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POR E EF -AsC E 


In a year filled with numerous national and international meetings, the 
response to the call for papers and the attendance at the 1977 International 
Conference on Parallel Processing have been extremely rewarding for its 
organizers. This conference, the sixth if one includes the Sagamore Computer 
Conferences to which it has succeeded, is now regarded as a regular annual 
event. The 1977 conference, as its predecessor, had the formal support of 
the IEEE Computer Society which is handling the production and distribution of 
‘these Proceedings and of the Association for Computing Machinery. 

This year more than 80 papers were submitted with prospective authors 
coming from 9 countries. Each paper was refereed by at least two referees. 
The 96 individuals who made this possible are listed at the end of these 
proceedings and I would like to thank them personally for a job well done. 
Special thanks are also due to Dr. U. Herzog who served as a liaison with 
some European contributors, Mr. M. Kesselman who volunteered to organize a 
panel on multiple microprocessors systems, Mr. J. McKay who set up a session 
on PEPE, and Dr. M. Freeman and TCCA who helped in organizing and refereeing 
papers for a session on Computer Architecture. 

I think that the participants at the Conference will agree with me 
for sending congratulations to Dr. Charles Elliott and his staff at Wayne 
State University for taking care of impeccable local arrangements. I would 
also like to thank Ms. Marcia Riedel at the University of Washington for her 
help. 

Last, but certainly not least, we owe a great debt of gratitude to 
Professor Tse-~yun Feng. Dr. Feng, who originated and organized the first 
four conferences, was General Chairman of the 1977 ICPP. With him as a 
constant driving force, we can look forward with great anticipation to next 


year's meeting. 


-Jean-Loup Baer 


1977 ICPP Program Chairman 
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PERSPECTIVES OF PARALLEL PROCESSING 


Franklin H. 


Westervelt 


Director, Computing Services Center 
Professor of Computer Science 
College of Liberal Arts 
Professor of Engineering 
College of Engineering 


Wayne State University 
Detroit, Michigan 48202 


Keynote Address 
1977 International Conference 


Parallel Processing 


Over the past years, the keynote 
address for the International Conference 
on Parallel Processing has attempted to 
present one or another view of Parallel 
Processing and, in so doing, provide a 
point of beginning and a challenge for the 
conference and its work. Each view of 
parallel processing tends to see and to 
emphasize certain aspects of parallel 
processing and its state of development. 
In a very real sense, each view provides 
another perspective of parallel 
processing. 


It has been observed that the "real 
world" is the union of an unbounded finite 
number of "unreal worlds". Each unreal 
world is a model or perspective of the 
real world, or some aspect of it, held by 
a particular observer. Occasionally, many 
observers share a common perspective, at 
least for a time, and cause a. particular 
view to acquire a certain popularity and 
acceptance. But it is of considerable 
importance for each of us to develop 
flexibility and adaptiveness so that we 


may recognize and appreciate the 
contributions provided for us by other 
views Or perspectives. The best 
perspectives recognize a great many 
features and fine structure and, in so 
doing, tend to provide unification and 


understanding of complex subjects, but it 
is also important to remember the 
difficulty inherent in viewing any 


complex, multidimensional subject from a 
finite number of perspectives, let alone 
from a Single point, and thereby obtaining 
an adequate picture of the subject. 


This address will provide yet another 
perspective of parallel processing. But 
the point of view is, hopefully, different 
enough that some may discern new features 
and new challenges. 


Parallelism in computation machinery 
has been recognized and incorporated from 


the very beginning. Circuitry to provide, 
for example, parallel addition appeared 
almost concurrently with serial circuits 
for the same functional purpose. 
Designers have always recognized the 
improvement in speed and performance to be 
gained through parallelism. The ability 
of each member of the audience to carry 
on, at this very moment, extremely 
difficult feats of audio and visual 
pattern recognition and interpretation 


very eaSily while employing receptors and 
information processors whose performance 
specifications are comparatively 
pedestrian is possible only because of 
guite incredible parallel processing 
inherent in each one of us. The human 
being is, indeed, a most remarkable 80 kg 
non-linear parallel processing 


servomechanism capable of mass production 
by unskilled labor. 


The individual logic devices which 
together comprise the human parallel 
processor can each be put to shame in many 
ways by components already state-of-the- 
art in contemporary computing systems. 
Yet we remain an enormous distance away 
from being able to assemble, vackage and 
power ae parallel processor of like 
complexity and generality. It is 
interesting to note the anthropomorphic 
inspiration present in recent research on 
optical pattern recognition systems. and 
interconnected cellular automata. Man has 
always derived great benefit from the 
study and modeling of existing systems and 
from using these studies and models to 
enhance and improve upon various aspects 
or features of them. Parallel Processing 
research should be found to be no 
different in this respect. 


remarks as 
toward the 


With these very general 
background, let me move 
presentation of my particular versonal 
perspective of parallel processing. Here 
my experience in providing systems for all 
aspects of computational services in 


higher education at two major United 
States universities must necessarily bias 
my point of view. But I believe that my 
perspective may be of sufficient general 
interest on a larger scale to merit your 
consideration. At the close of this 
address, I hope that we share in a mutual 
exchange of question and comment springing 
from these remarks. 


As we are all well aware, the world 
presently faces an energy crisis. But it 
is more nearly correct to recognize the 
crisis to be in the consumption of certain 
particular forms of fuels. In other 
words, the problem lies in the most 
appropriate use of raw materials as their 
finite supply decreases and the cost of 
acquiring them increases. The complex 
long chain hydrocarbons present in fossil 
fuels are a resource of chemical building 
blocks that is much too valuable to be 
simply oxidized by burning. I must 
believe that the descendants of our 
children's children will be most critical 
of our generation for having squandered 
and destroyed these complex molecules in 
such enormous quantities by burning them. 
Yet the world need for abundant energy in 
order to provide an adequate food supply 
and general living standards for its 
burgeoning population must be met. 


Because the energy needs must be met 
if we are to survive and continue as a 
Civilization, I will make no further 
comment on this situation at this time. I 
Shall assume the solution for the energy 
supply problem and focus my attention on 
another longer range problem. In the 
longer run, the obviously finite size of 
the planet Earth and the material 
resources available to it within its 
reasonable sphere of acquisition will, in 
my opinion, result in the problems of most 
efficient and effective use of all natural 
resources becoming the overriding concern 
for all people. Materials may become too 
valuable and costly to permit anything but 
highly automated plants and machines’ to 
handle, mold, cut, shape and form them 
into products for our use. Reduction of 
the waste of materials by the progressive 
elimination of the human error factor in 
manufacture and production will come to be 
a dominant objective achievable through 


increased application of automation. Many 
products, particularly in the computer 
field, are only possible to be made at 
all, even today, because of complex, 
highly automated machines which operate 


with very minimal human intervention. The 
concurrent development and application of 
information processing in its most general 
sense and the consequent impact on further 
development of parallel processing 
methodology follows immediately in the 


industrial and commercial arena. 


But the systematic reduction of human 
error in manufacture and production will 
Carry with it another effect. It has been 
observed that, in any reasonably complex 
system, there is no such thing as a change 
that produces only a single effect. As 
the use of automation increases, the 
amount of conventional work performed by 
humans in manufacture and production will 
decrease. Leisure time growth today is 
viewed as positive by many who may have 
had their time occupied to too great an 
extent in the past by conventional work. 
But society does not yet compensate 
leisure in any general sense and for some 


persons bypassed by technology "leisure 
time" may be only another term for 
"unemployment". 

Unemployment is a problem of concern 


when national levels are in the 5% to 10% 
range. But I can tell you from recent 
personal experience in the Detroit area 
where unemployment in some segments of the 
population reached 50% or more during the 
recent recession that "concern" is simply 
not an adeguate term to apply to such a 
problem. Consider then what the situation 
might be if conventional work decreased 
such that very high levels of unemployment 
became common on a world scale. Each of 
you may construct your own image of such a 
world. 


The world faces a dilemma: On the one 
hand, the Puritan work ethic will tend to 
decline to compensate leisure while, on 
the other hand, scarcity of material 
resources will cause conventional work to 
decline as well. A solution for this 
apparent dilemma will, in my opinion, come 
from a redefinition of “conventional work" 
and from a shift in human activities from 
those that are intensive in the 
consumption of non-renewable materials 
toward activities that will tend to be 
energy~intensive, and in many cases nearly 
energy-excluSive. In other words, energy 
will tend to become the one resource that 
humans will, in general, be permitted to 
use and consume in significant amounts 
because it will be the most easily 
replenished resource. 


Consider some of the kinds of energy- 
intensive activity implied by such a world 
Situation. "Work" may be structured in 
terms of interaction at many different 
levels of intellectual capability and 
skill uSing the technology of extremely 
advanced communication, simulation and 
computation to enable persons’ to learn 
complicated new skills and to be 
compensated for doing so. In ‘the process 
of such learning and development, actual 


materials would be consumed very sparingly 


while the process may consume significant 
energy in order to be carried out 
properly. 

If this should seem too farfetched, 


consider only a few of the things that we 
are presently doing of this nature. We 
are all aware of the elaborate simulations 
used in the development of skills needed 
by the Astronauts and Cosmonauts. When 
the first Astronaut actually stepped upon 
the surface of the Moon, after consuming 
enormous guantities of real natural 
Materials in order to get there, his pulse 
rate, respiration and blood pressure 
showed no indication of his awareness of 
being in Surroundings that for the rest of 


us must still be regarded as fantastic! 
Of course not, this human being had 
already "been there before" many times 
through simulation that was incredibly 


"real" and which, by comparison, consumed 
almost no natural resources. Furthermore, 
this Astronaut was among many who 
experienced the same training through 
Simulation and were paid to do it. 


Airline pilots and the captains of 
Supertankers are also examples of skill 
development and learning through the use 
of sophisticated simulations. I need not 


relate to this audience the critical -role 
played by computer technology in these 
cases. I should only like to point out 


that many much less sophisticated examples 
exist where compensation has been given to 
those learning or developing new skills or 
capabilities. The learning of foreign 
language while in military service is a 
most familiar example. The extension and 
refinement of these and similar examples 
is, perhaps, the mechanism by which "work" 
in the future may come to be redefined. 


If something similar to this should 


come about, it will reouire significant 
new developments in parallel processing 
for general purpose computation in 


addition to the special purpose forms that 
receive most of our attention today. It 
is of great importance that the necessary 
research and development of large general 
purpose parallel processors be funded now. 


TO develop my perspectives of 
parallel processing further, I should now 
like to turn my attention toward a much 
closer but highly related problem. Ina 
very strong sense, both the foregoing 
problem and the one upon which I now focus 
are related to education and learning. 
Education, in general, and Higher 
Education, in particular, is an extremely 
labor intensive business today. For many, 
if not most, colleges and universities the 
fraction of General Fund Budget committed 


to salaries and wages is 70% to 80% or, in 
some cases, more. 


As a result of declining numbers of 
students in the primary and secondary 
schools and general inflationary pressures 


on salaries and wages, colleges’ and 
universities face the very serious 
prospect of "pricing themselves out of 


business" in the next decade. Tuition is 
already at a level that tends to require 
one or more forms of student financial 
assistance, even for students from 
families nominally considered to be well- 


to-do. For less advantaged students, 
higher education in the absence of 
substantial student financial aid is 


already priced beyond reach. 


In order to preserve or, better, 
improve the quality of higher education 
while holding or, better, reducing the 
cost, means for improving the productivity 
of the system must be found. Other 
business and industry faced with the same 
sort of problem turned to technology for 
help in solving it. Unfortunately, much 
of the technology relevant to industry is 
not relevant to higher education in trying 
to improve productivity. 


There are, however, two general 
technologies with considerable relevance 
to this problem. One, the general 


"broadcast" technology, provides many ways 
to improve upon the dissemination of 
information in the "one-to-many" mode. 
Audio-visual techniques, including the 
entire scope from films and tapes to 
video, all extend the audience of a given 
educator and tend to reduce the unit or 
per-student cost of conveying the 
particular information or lesson. In 
general, the more effectively the 
broadcast technology expands the audience 


Size, however, the less effectively does 
the technology accommodate to the 
individual needs of particular students. 


In other words, the "unfair advantage" 
that distinguishes the university or 
college from the correspondence’ school, 


interaction, tends to be 
seriously impaired. And with this loss of 
interaction, the quality of the 
educational process is also impaired. 


student-teacher 


Again we face a dilemma: it appears 
essential to improve the productivity of 
higher education in terms of numbers of 
Students per person engaged in the 
process, yet it is the interaction or 
feedback of the one-on-one educational 
experience that characterizes the finest 
aspects of that process. 


field of 
other 


The broad 
technology is the 


computer 
technology 


relevant and uniguely suited to assist in 
the resolution of this dilemma. Where the 
broadcast technologies tend toward simplex 


communications channels, computer 
technologies have emphasized duplex 
communications in many relevant forms. 
The essential contribution is the 


provision of a "many-to-one" technology to 
provide more efficient and economical 
interaction and feedback for use in higher 
education. 


The simplest examples of currently 
available technigues are little more than 
conventional store-and-forward 
communications systems. Computer 
conferencing or asynchronous conferencing 
comes much closer to the level of 
technological assistance required for the 
solution of the quality/quantity versus 
unit cost dilemma of higher education. 


A great deal can be done with 
existing computer systems in this area. 
But to reach the levels of reliability and 
generality required to really solve the 
problem, we do not yet have the computer 


systems available with general purpose 
characteristics and the configurability 
necessary to deliver the appropriate 


computational power to a very large number 
of dynamically created and changing tasks. 
Parallel processing research and 
development holds the promise of making 
the required systems available. 


I should like to take a moment to 
describe a little of the work now being 
done at Wayne State University which, I 


hope, may be relevant to parts of the 
solution for the foregoing problem. Let 
me first give you a brief picture of the 
university itself. 


Wayne State University iS a major 
urban university with a number of rather 
unigue characteristics. At any given 
moment, Wayne State University is an 
active community of about 40,000 or more 
students and faculty. But the momentarily 
active student body of 30,000 to 35,000 is 


drawn dynamically by personal 
circumstances of work and study from a 
student population admitted to the 
university numbering well in excess of 


100,000 individuals. 


It has been demonstrated, for example 
by Dartmouth, that when sufficient 
computer resources can be made available 
and accessible, nearly 70% to 80% of a 
university community will find the 
resource meaningfully relevant to their 
educational experience. 


But it is a considerable challenge to 
try to provide such access using 


over six 


contemoorary systems at an. institution. 
that is an order of magnitude larger than 
Dartmouth. 


we began 
acquire the 


At Wayne State University, 
years ago to. 
facilities step by step and to develop a 
hierarchical computing system for the 
university that might address the problems 
of higher education as rapidly and 
effectively as resources and technology 
would permit. The first decision was to 
purchase, in 1971, a dual processor IBM 


System/360 Model 67 configured as ae full 
duplex system. Since then we have been 
able to retire the loan used to acquire 


that system and to use the system as a 
foundation for further system growth and 
development. This April, we added an 
Amdahl 470V/6 system as a part of the 
plan. : 


The design limitation on main memory 
Size and the lack of Error Correcting Code 
capability in the standard IBM memory 
products for the Model 67, as well as the 
relatively high cost of IBM memory, 
resulted in a contract between Wayne State 
University and Fairchild Memory Systems. 
Under this contract, the parties combined 
talents to design 2 bipolar semiconductor 
memory using the same 256xl TTL 100 os 
memory chips supplied by Fairchild for the 


Illiac IV. This memory system has several 
interesting features, such as a memory 
controller capable of executing 
instructions, and it has demonstrated 


Significantly better performance and 
economics. Instead of being limited to a 
maximum of 2 Megabytes, we have 4.25 
Megabytes today at an average system cost 
of about four cents per bit. 

The Model 67 duplex architecture 
remains unique in the IBM family and is 
generally very poorly understood. This 
architecture provides features to enhance 
the parallel processing carried on by the 
processors and channel controllers. These 
features have been exploited in the MTS 
(Michigan Terminal System) implementation. 


While best known for its Address 
Translation hardware, the Model 67 bus 
Structure providing for up to eight 


processors or channel controllers to share 
main memory and its extremely flexible 
configuration capability features are at 
least as important and significant. Quite 
unlike the more common MP systems produced 


by IBM. the duplex Model 67 provides 
system symmetry and consequent 
Simplification of system design. In MTS, 
for example, the only lack of complete 


homogeneity among the processors is in the 
keeping of Time of Day. Since an 
independent clock for that purpose was not 
a part of the system hardware, this’ task 


is uniquely assigned to whichever 
processor was IPLed first. Otherwise’ the 
processors are treated in a completely 
homogeneous fashion. The MTS software is 
designed to support the maximum of four 
processors and four channel controllers 
Supported by the bus. IBM never built a 
maximum configuration to my knowledge, and 
only one triplex, which was never 
delivered to a customer. It is most 
unfortunate that the features of the Model 
67 were never carried forward into later 
systems by IBM. 


As but a single 
importance and utility of these features 
when combined with appropriate operating 
system software, a soon to be released 
paper by Professor R. J. Srodawa of the 
Wayne State University Computer Science 
faculty reports and discusses the 
achievement of dual processor throughput 
more than double that of the single 
processor case. The literature commonly 
cites factors of 1.5 to 1.8 for such 
systems. While there is need for more 
experimentation and modeling, these 
results indicate that it is possible to 
attain significantly better systems 
performance’ than has generally been 
reached and reported elsewhere. It is 
also important to recognize the existence 
of practical cases in which a_ two 
processor duplex system can produce more 
than twice the throughput of a single 
processor system with the same memory 
size. Such a result is by no means a 
contradiction of the Second Law of 
Thermodynamics. There are many reasons 
why such a result is attained in this 
case. Clearly these results require both 
hardware which 1s designed with special 
attention paid to issues of symmetry, lock 
contention, storage contention and inter- 


example of the 


processor communication as well as an 
operating system designed with special 
attention to these same issues and 
including design features that do not 


double or more than double system overhead 
in going from one to two processors. The 
fundamental work of Alexander et al at the 
University of Michigan in the design of 
MTS should be much better and more widely 
understood. 


In April of this year, a 4 Megabyte 
Amdahl 470V/6 was added to our duplex 
Model 67 configuration. This well known 


Pipelined machine was installed so quickly 
that our own site preparation delay in 
obtaining 400 Hz power held up initial 
operation by two days. Since power became 
available, we have been extremely pleased 
with the reliability and performance of 
this system. The Amdahl is being run 
under the VM (Virtual Machine) operating 
system in order to accomplish the 


Significant system work needed to 
interconnect the 67s and the Amdahl using 
two CCAs (Channel-to-Channel Adapters) in 
a full-duplex communications protocol. 
When ready for use in this mode, the 67s 
will act as "frontend processors" for the 


Amdahl and will enable concurrent support 
for a large number of interactive 
terminals performing relatively small, 


quick response tasks and for a smaller 
number of large, more demanding tasks with 
slower response. The 67s are further 
"frontended" by intelligent terminal 
controllers to provide flexible and prompt 
communications response. One such 
controller, based on a PDP-ll, serves as 
the communications controller for up to 32 
terminals and the MERIT computer network 
linking our facility at Wayne State 


University to TELENET and to the CDC 6400 
at Michigan State University and the 
Amdahl 470V/6 at the University of 


Michigan. The goal for this system is to 
form a hierarchical system capable of 
providing good service for 400 to 500 
concurrent general purpose timesharing 
lines or users. These users presently 
connect a wide variety of remote terminals 


to our system ranging from "dumb" 
typewriter or CRT devices to quite 
"intelligent" micro- or mini-based 


graphics and laboratory systems. 


The objective is to provide very 
economical access to a computational 
resource able to provide a "match" for a 
given problem with the computing power 
necessary for effective interaction. And 
to accomplish this in a uSer-transparent 
manner and on a scale consistent with the 
Size of our university community of users. 


We see a great many challenging 
problems in various aspects of parallel 
processing to be solved in order’ to 
achieve our goals and objectives. Wwe 
believe that dealing with these problems 
in a real environment of demanding users 


will cause us to seek out and implement 
solutions that will contribute to the 
understanding necessary for improved 


future systems. 


One characteristic of our system that 
should be clear to all is its combination 
of processors of a wide range of size, 
bandwidth and capability. I am frequently 
amused and sometimes annoyed by the 
various proponents and protagonists in the 
mini- vs midi- vs maxi-computer system 
arguments. 


I have held that we, as computer 
people, have yet to build anything but 
minicomputer systems. Until we actually 
build a real maxi-computer, I believe that 
we have no basis for such arguments. 


To illustrate my point, several years 


ago I served on the Board of the Argonne 


Universities Association, the governance 
body for the Argonne National Laboratory. 
The Laboratory had just acquired an IBM 
195 system at a-cost of some $10 to $12 
million. Clearly a system that most might 
feel to be a maxi-computer. 


On the return flight to Detroit, my 
seat companion was another member of the 
AUA Board who was also a vice president of 
the Detroit Edison Company. Thinking 
about the 195 acquisition, I asked him, 
"When was the last time that Detroit 
Edison acguired a major power generator 
for $10 million?" My companion laughed 
and said, “Good Heavens! The transformer 
substation for the Renaissance Center cost 
more than that!" Which is exactly my 
point, a single major power generating 
Station today represents nearly $1 billion 
and there are many such installations over 
the entire United States, let alone over 
the world! On the other hand, while we 
talk of computer utilities and maxi- 
computers, we have yet to conceive of, let 
alone build, any comparable scale machine. 


Until we have designed and built such a 
scale machine, I believe that we are 
dealing with mini-computers and networks 


of them, no matter what actual mainframe 


we may be talking about. 


When it is finally decided that a 
true maxi-computer should be built for the 
first time, it seems clear that parallel 
processing must infuse the entire design. 
Parallelism to improve speed and 
performance, parallelism to improve system 


reliability and availability, parallelism 
to enable dynamic configuration, 
partitioning and assignment of 


computational power appropriate to a very 


large number of both independent = and 
interdependent tasks, parallelism to 
enable rapid and efficient processing of 


very large databases required to meet’ the 


needs of society. The call to this 
International Conference on Parallel 
Processing seems clear and the future 


exciting and challenging. 


complementary 


If one places today's point in the 
development of modern computer technology 
on the same time scale as the. Industrial 
Revolution beginning with James Watt's 
Condensing Steam Engine, then we have just 
this year seen Robert Fulton's first 
commercial steamboat! While we recognize 
that the advance of technology tends to be 
exponential and _ that, as a result, 
progress on an absolute scale in our time 
is much larger than from Watt to Fulton, I 
believe that our relative progress in 
computer technology as viewed from a 
century or two hence will appear to have 
been no. greater! We have no justification 
to feel superior or to fail to press 
forward with maximum effort. 


Depending upon how one keeps” the 
score, we have moved toward the serial 
limit of machine computation by seven to 
nine orders of magnitude since Eckert and 
Mauchly, and again depending upon who 
attempts to establish the serial limit, we 
have perhaps five or six orders of 
Magnitude remaining. Allowing for 
exponential effects in 
difficulty and technology advance, we 
should approach the serial limits rather 
closely when we have lapsed again the time 
interval already past in computer 
technology development. We must continue 
to press forward toward the serial limits, 
but it becomes guite clear that we must 
become increasingly aware of and sensitive 
to the vital role of parallelism in the 
future of machine computation. 


These are some of my persvectives of 
parallel processing. I hope that you may 
have gained from sharing them with me even 
a tiny fraction of the pleasure that it 
has been for me to present them to such a 
distinguished and important audience. IT 
want to add my welcome to that of Wayne 
State University and the IEEE Computer 
Society to the 1977 International 
Conference on Parallel Processing and _ to 
the important work ahead of you. Thank 
you for your most gracious consideration 
and attention. 
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1. Remarks on 
Classification Schemes and Formal Systems 


Classification schemes, languages, and formal 
systems of all kinds have a considerable influence 
on our thinking. Structures which are inherently 
the subject-matter of a language as well as of 
classification schemes form the basic material of 
what can be expressed in a language or can be com- 
prehended from its position in a classification 
scheme. The same statement seems to be valid for 
formal systems in a more specific sense. Thus the 
tool can be used in the application area for which 
it was created. 


For example the Ricci-Calculus performs this 
role only in the area for which it was created, 
certain areas of physics and partial differential 
equations. Outside this area problems arise for 
which it is not suitable. 


B. Whorf has said that language guides thought 
[11 ] and that therefore language sometimes pre- 
vents the appropriate solution of a problem being 
found. We must admit that in many cases a language 
(it can be referred to as a calculus or notation) 
can be a barrier rather than an aid in solving a 
problem. It 1s also true that a classification 
scheme can be a barrier, although it can provide 
an insight into the relationships between the ele- 
ments of some group. 


If such a classification scheme is to be ap- 
plied to animals and plants, then the elements are 
existing objects and the scheme cannot completely 
fail, although the discovery of a new species can 
present difficulties in fitting it into an existing 
classification scheme. Such a scheme can be called 
a taxonomy, since all the species are considered 
to be descended from a single species, in accor- 
dance with the biological theory of evolution. 


It seems more difficult to create a classifi- 
cation scheme, or even a taxonomy, for some area 
of contemporary technology. It 1s necessary to 
project future advances as well as placing existing 
examples in it. 
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The aim.of this paper is to: show that some 
existing schemes may fail to indicate the right 
direction for the development of computer archi- 
tecture, aS compared with a new and promising clas- 
sification scheme introduced in [3], [4]. We 
would, however, not claim that the proposed classi- 
fication scheme will cover all computer structures 
which will arise in the future. We do show that 
the proposed scheme does cover several very inte- 
resting structures which cannot be placed at an 
appropriate point in the scheme of Flynn [1] and 
Feng [2]. 


The justification of the proposed scheme is 
that it should be useful in classifying structures 
and concepts which will emerge in the next years, 
and be of use to the designers of these structures. 
A further justification of the scheme is that the 
elements of the classification scheme can be com- 
posed and decomposed by operations which are sui- 
table for the purposes of the computer architect. 


2. Contemporary Classification Schemes 


Existing classification schemes differ in the 
information on which they are based. For instance 
M. Flynn [1] bases his scheme on a ‘data stream’ 
and an ‘instruction stream'. By combining these 
Simple concepts he can classify many of the new 
computer structures. In contrast, Feng [2] empha- 
Sises the number of bits which are processed si-. 
multaneously. These schemes are outlined in sec- 
tions 2.1 and 2.2 in order to contrast them with 
the scheme outlined in chapter 3. In section 2.3 
the definitions of multiprocessing proposed by the 
American National Standards Institute [5] and by 
Enslow [6] are discussed. 


2.1 Flynn's Classification 


Flynn proposed in 1966 a classification based 
on the instruction streams and data streams. In the 
conventional Princeton type computer a single data 
stream is processed by a single instruction stream. 
This is described as SISD (single instruction 
Single data). 


In an array computer such as ILLIAC IV, a 
Single instruction stream processes many data 
streams. Such a computer is known as SIMD (single 
instruction multiple data). In ILLIAC IV 64 copies 


of the same instruction are executed simul taneous- 
ly by 64 arithmetic units. The Goodyear STARAN is 
also a SIMD computer. It differs from ILLIAC IV in 
many respects, in particular in being an associa- 
tive array processor. 


MISD is an abbreviation for multiple instruc- 
tion single data. Some authors include various 
types of pipeline computers in this class though 
it is doubtful whether this is appropriate, and it 
is unsatisfactory because it does not distinguish 
between the three kinds of pipelining (see section 
3.3 below). 


MIMD is an abbreviation for multiple instruc- 
tion multiple data. Here multiple processors are 
working on multiple data streams. The simplest case 
is where each processor is executing its own pro- 
gram on its own data. The processors can be con- 
nected via a bus system or can access multi-port 
memory. The classification does not contain any in- 
formation about the type of connection used. 


Flynn's classification is illustrated by fig. 
1, where many contemporary computers can be classi- 
fied by assigning them to one of the four verti- 
ces of a graph. However, the classification does 
not fully satisfy the needs of computer architects 
because it is not fine enough and because the in- 
terpretation of the class MISD is not clear (cf. 
[7]). In the literature many authors restrict them- 
selves to the classes SISD, SIMD, and MIMD. A fur- 
ther difficulty occurs if a computer contains both 
parallelism and pipelining. 


2.2 Feng's classification 


Feng [2] classifies according to the word- 
length, i.e. the number of bits which are processed. 
in parallel in a word, and the number of words 
which are processed in parallel. A computer struc- 
ture is represented by a point in a plane (fig.2) 
where the abscissa is the wordlength (normally 12, 
16, 24, 32, 48, 60 or 64), and the ordinate is the 
number of words processed in parallel. The latter 
can be determined by the number of processors. For 
example C.mmp which contains 16 PDP-11's with word- 
length 16 bits is represented by (16,16). The ordi- 
nate can also be determined by the number of arith- 
metic and logical units in an array processor. 

Thus ILLIAC IV is represented by (64,64). 


Thus Feng's classification does not allow to 
distinguish between multiprocessors like C.mmp and 
array processors. This caused Enslow [7] to repre- 
sent C.mmp in "gang" mode by (16,256). But C.mmp 
in gang mode can be regarded as similar to ILLIAC 
IV, with 16 ALU's executing a single program, 
which would give the point (16,16) which is the 
Same as when gang mode is not used. The classifi- 
cation also does not distinguish between autono- 
mous processors which execute programs and ALU's 
which execute operations, i.e. it does not distin- 
guish between processing levels. 


The TIASC (Texas Instruments Advanced Scienti- 
fic Computer) is represented as (64,2048). The 
number 2048 os obtained from the 4 pipelines each 
consisting of 8 stages with 64 bits. However the 


number 2048 can be obtained in many ways, e.g. 8 
pipelines, 8 stages, 32 bits. Thus the classifica- 
tion cannot represent a multiple pipeline structure 
like the TIASC accurately. 


It is also not possible to represent the pipe- 
line structure at the program level of PEPE. PEPE 
is characterized as (32,16), and the fact that each 
set of data (up to 288, each representing a flying 
object) is processed successively in three diffe- 
rent ways is not represented. This is performed in 
three separate series of ALU's, and we can regard 
this as a three stage macropipeline (cf. section 
cree 


The lack of a rigorous definition of pipelin- 
ing in the context of Feng's classification scheme 
leads to difficulties in classifying structures 
containing both pipelining and parallelism. Thus 
the scheme is not entirely satisfactory for the 
computer architect either. | 


2.3 Definition of Multiprocessing 


Similarly to classification schemes, if defi- 
nitions are too narrow, some viable computer struc- 
tures may be excluded from consideration. 


The American National Standards Institute [5] 
defines a multiprocessor as: 


"A computer employing two or more processing 
units under integrated control." Manufacturers of 
systems containing two to four processors did not 
find themselves in conflict with this definition. 
The definition did not exclude future developments 
in computer architecture, but does not seem to have 
had any impact on contemporary architecture. Sub- 
sequently Enslow suggested a more detailed defini- 
tion in his excellent book [6] which included 


1. two or more processors, having access to 
a common memory, whereby private memory iS not ex- 
cluded, 


2. shared I/0, 
3. a Single integrated operating system, 


4, hardware and software interactions at all 
levels, 


5. the execution of a job must be possible on 
different processors, 


6. hardware interrupts. 


We will concentrate on the first characteri- 
Stic: 


A common memory is mandatory. Such a structure 
is shown in fig. 3. It is easily seen that as the 
number of processors increases the congestion in 
the access to the common memory will also increase. 
Thus Enslow's definition seems to exclude systems 
containing very large numbers of processors. Micro- 
processors costing a few dollars are now available, 
so that systems containing thousands of processors 
are now possible. Some of the more progressive pro- 


jects of computer architecture such as PRIME [9] 
are also excluded. On the other hand some struc- 
tures which satisfy Enslow's definition are subject 
to severe limitations on their expandibility and 
application due to their use of an expensive cross- 
bar switch [10]. | 


Thus Enslow's definition does not either satis- 
fy the requirements of contemporary computer ar- 
chitecture. 


2.4 The Influence of Classification 
Schemes and Definitions 


We have tried to show in the previous secti- 
ons that definitions and classification schemes 
have their limitations and can prove a hindrance 
beyond a certain point. The computer architect 
Should recognize when this point has been reached, 
and consider whether an entirely new classifica- 
tion scheme or definition is needed, which will 
ideally include all existing structures within a 
particular area and also all structures which wil] 
be considered in this area in the future. There is 
no doubt that one should consider very carefully 
the consequences of introducing a new classifica- 
tion, because of its possible educational and nor- 
mative effects. 


3. The Erlangen Classification Scheme 


3.1 Introduction 


The Erlangen classification scheme (ECS) was 
developed mainly in order to avoid the drawbacks 
of existing classification schemes, as outlined in 
section 2. 


The basic requirements are 


1. the objects to be classified should not be 
unnecessarily restricted. Any kind of computer sy- 
Stem - in particular parallel processors, array 
processors, multiprocessors, pipeline processors 
must be classifiable in the scheme; 


2. the classification must be sufficiently 
fine to express those differences between the ob- 
jects considered important; 


3. the classification must be unambiguous. 


The classification scheme developed was also 
found to be a useful technique in computer archi- 
tecture, in the sense that: 


4. Composed computer configurations can be 
described by using operators which are applied 
to primitive elements of the scheme. 


5. It can be used in evaluating architectural] 
configurations, in particular with reference to 
cost. 


6. It provides a measure for the flexibility 
of a system. 


7. It provides a starting point for scheduling 


of flexible structures. 


The objects of the classification are not ne- 
cessarily computers only. This will be amplified 
below. The flexibility mentioned in 6. above is 
connected with the fact that a computer can be re- 
presented by more than one point in the classifi- 
cation. The various points which represent a com- 
puter will be referred to as modes. The more modes 
a computer has, the more choice of mode it has for 
a particular application, and so the greater is 
its flexibility. 


The classification scheme can be used for al- 
gorithms as well as for computers, and demon- 
strates the inherent partitioning of the algorithm 
into parallel sections and pipeline stages. The 
classification of algorithms must then be related 
to the classification of the computers on which 
they are to be run. In general, jobs must be in- 
vestigated to identify the classes of the algo- 
rithms contained, and matched to the classes of 
the computers on which they are to be run. A more 
detailed discussion of this question will be 
given in another paper. 


3.2 Parallelism 


Our classification aims at characterizing 
the parallelism and pipelining present in a com- 
puter system. The connections between the pro- 
cessors and the memory blocks are not included 
in the classification. It is assumed that the con- 
nections can carry the expected traffic and pro- 
vide the required availability. In such a case 
the performance of the system is mainly determined 
by the processors, including their capability to 
transfer information. 


The classification is based on the distinc- 
tion between three processing levels: 


1. Program control unit - Using a program 
counter and some other registers, and, in most 
cases, a microprogram device, the PCU interprets 
a program instruction by instruction. 


2. Arithmetic and logical unit - The ALU uses 
the output signals of a microprogram device to 
execute sequences of microinstructions according 
to the interpretation process performed by the PCU. 


3. Elementary logic circuit - Each of the 
microoperations which make up the microoperation 
set initiates an elementary switching process. The 
logic circuits belonging to one bit position of 
all the microoperations are called an ELC. 


A computer configuration can include a number 
of PCU's. Each PCU can control a number of ALU's 
all of which perform the same operation at any 
given time. Finally, each ALU contains a number 
of ELC's, each dedicated to one bit position. The 
number of ECL's is commonly known as the word- 
length. 


If we disregard pipelining for the moment, 
the number of PCU's, ALU's per PCU, and ELC's per 


ALU form a triple, written 
t (computer type) = (k, d, w). 


We give some examples of the triple, where we 
assume that the reader is familiar with at least 
some of the computers: 


t(MINIMA) = (1,1,1) 


The "classical" serial computer. Some early 
European computers were of this form. 


t(IBM701) = (1,1,36) 


An example of the early "parallel" (on the 
3rd level) Princeton computers. 


t(SOLOMON) = (1,1024,1) 
The historical concept of an array processor. 
t(ILLIAC IV) = (1,64,64) 


The famous array processor developed at the 
University of I]linois (without PDP 10). 


t(STARAN) = (1,8192,1) 


The well-known associative array processor 
(without host and sequential control processor) 
fully extended (32 frames of 256 bits each). 


t(C.mmp) = (16,1,16) 


The Carnegie-Mellon University mulit-mini 
project using 16 PDP-1l1's.. 


t(PRIME) = (5,1,16) 


The University of California, Berkeley, pro- 
ject in which time-sharing is replaced by multi- | 
processing. . 


The different systems exhibit different kinds 
of parallelism, which is uniquely attached to one 
of the three levels. The numbers which make up the 
triple show this directly. 


At first sight, the triples are able to clas- 
sify all viable structures, particularly in regard 
to parallelism. But although parallelism is the 
most important phenomenon in contemporary computer 
architecture, pipelining must also be considered. 
The examples above exhibit parallelism but not 
pipelining. In the next section the classification 
is extended to include pipelining. 


3.3 Pipelining 


Pipelining can also be implemented at the 
three levels described in section 3.2, i.e. 
1. PCU, 2. ALU, and 3. ELC. 


For example level 3 pipelining is the well- 
known pipelining of the arithmetic unit used in 
the CD STAR-100 and the TIASC. The STAR-100 uses 
a four stage pipeline and the TIASC an eight stage 


pipeline. 


An arithmetical pipeline can be regarded as a 
"vertical" replication of ECL's, compared with the 
"horizontal" replication used in a parallel ECL. 
It is therefore reasonable to multiply the number 
of ECL's, w, by the number of stages in the pipe- 
line, w', to characterize the ALU. For the TIASC 
we have then 


t(TIASC) = (1,4,64x8). 


The multiplication sign will be used at al] 
levels to separate the number representing the de- 
gree of parallelism from the number representing 
the number of stages in the pipeline. 


The next higher level of pipelining is in- 
struction pipelining. This involves the existence 
of a number of function units which can operate 
Simultaneously to process a Single instruction 
stream. It is based on the inspection of instruc- 
tions prior to execution to identify those instruc- 
tions which can be executed simultaneously without 
conflict. This is done by a scoreboard, in the ter- 
minology of Control Data. These instructions are 
executed as soon as a Suitable function unit is 
free. This technique is referred to as "“instructi- 
on lookahead", “instruction pipelining", or "“paral- 
lelism of function units". | 


A classical example of this kind is the CD 
6600 computer. Disregarding for the moment the in- 
put-output section (i.e. the peripheral processors), 
the internal structure of CD 6600 with 10 function 
units becomes: 


t(CD 6600 central proc.) = (1,1x10,60). 


The 10 units in this case are highly specia- 
lized (e.g. floating point multiplication, integer 
addition, incrementation, etc.) and therefore a 
gain of a factor of 10 cannot be achieved. The real 
factor depends on the special program actually rur 
ning. An average of 2.6 is a typical figure accor- 
ding to information available from Control Data. 

A combination of several function units of the same 
type seems to be quite reasonable regarding the 
better utilization of equipment on the one hand 

and the now available large-scale integration tech- 
nology on the other hand. These latter considera- 
tions nevertheless are not directly a subject of 
this paper. 


Finally, we have to consider the pipelining 
concept of level 1, which is so far not very known. 
This concept can be called “macro-pipelining"[12]. 
Assuming that a data set has to be processed by 
two different tasks sequentially, then it can be 
performed in two different processors, each one 
processing one task. The data stream then passes 
the first processor (1. task), is stored in a me- 
mory block, which the second processor also has 
access to, and will then pass the second processor 
(2. task). Since both processors can work at the 
same time (on different data), the effective pro- 
cessing speed can be in an ideal case doubled in 


comparison with the use of only one processor. 
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In such a way stepping from processor to processor 
data are 'refined' [12] on one hand or are ‘inte- 
grated' [13, 8] in the case of ordinary differenti- 
al equations on the other hand. 


The PEPE array (without the host installation) 
then is characterized as 


t(PEPE) = 


(1x3,288,32) (3-fold macropipel ining). 


Summarizing now, the triple has been extended 
to a sixtuple to incorporate pipelining. Neverthe- 
less, we keep calling it a triple because the three 
levels of consideration (as introduced in 3.2) sug- 
gest that we think in three terms, which have to be 
extended in some cases by an additional term, at- 
tached to the other value (of the same level) by 
using the sign x. 


The triple now reads as follows: 


t = (k x k', d x d', w x w') 


number of: 


PCU'S in parallel 
(multi-processor) 


PCU's in pipelining 
(macro-pipel ining) 


ALU's in parallel 
(array computer ) 


ALU's in pipelining 
(instruction pipe- 
1 ining-1ookahead ) 


ELC in parallel 
(wordlength) 


ELC in pipelining 
(arithmetic pipe- 
lining) 


All entities are independent of one another. 
All combinations therefore can appear. 


Regarding the 'completeness' we claimed in 
Section 3.1, we would have to prove that, apart 
from the three levels mentioned in section 3.2, no 
essential other level can be defined, and that 
there are also no phenomena apart from parallelism 
and pipelining. This is not pointed out in detail 
here, because this paper centers on another point, 
the impact of classification schemes on computer 
architecture. But there is some evidence regarding 
the completeness of our classification. While there 
are some modifications in details, how the level 
2 pipelining is designed, there are no doubts about 
the other levels. With respect to parallelism and 
pipelining there is an exclusive duality as is 
known from other fields of science where paral lel- 
ism and serialism also appear. 


Regarding the triple notation, we introduced 
the following simplifications: 
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k=1, or k'=1, or d=l etc. mean, respectively, the 
simple cases, in which no parallelism or pipeli- 
ning appear; 


we write then 


(1xk',dxd' ,wxw') = (xk', dxd', wxw') 

if k'41 
(kxl,dxd' ,wxw') = (k,dxd', wxw') 
(kxk',1xd' ,wxw') = (kxk',xd' ,wxw' ) 

if d'+1 
(kxk' ,dxl,wxw') = (kxk',d,wxw' ) 
(kxk! ,dxd',1lxw') = (kxk',dxd',xw' ) 

if w'¢1 


(kxk' ,dxd' ywx1) = (kxk!,dxd! ,w) 


If there is any form of pipelining then the 
character x is preserved in the corresponding le- 
vel. In the case of no pipelining the triple de- 
generates to 


t(MODEL) = (k,d,w). 


This convention contributes to the clearness 
considerably as well as to the transparency of no- 
tation. Therefore we will use this convention in 
the following. 


3.4 Operations on Triples 


As a triple characterizes a computer structure 
of a certain homogeneity, a combination of triples 
connected by an operator can denote 


a) @ more complex computer structure 
(as given e.g. by a.special I/0 section 
of processors or by a special host, 
which are connected to a specific com- 
puter configuration); 


a selection of operation modes of a 
structure, which can be used alterna- 
tively, fitting to different needs, 
according to the algorithmic nature 
of different applications. 


It should be noted in connexion with b) that 
for any application there can exist a number of 
algorithms, each one fitting a different computer 
structure. E.g. one algorithm which is a solution 
to a given problem can be highly suited for exe- 
cution on a conventional Princeton type computer, 
while another may be better suited for a parallel 
or pipelining computer. 


The forementioned computer CD 6600 would read 
its complete structure, using a multiplication 
Sign x: 

t(CD 6600) = (10,1,12) x (1,x10,60). 


The first term on the right hand side of "=" 


denotes the existence of ten processors of a Sim- 
ple structure with a wordlength of 12 bits. The 
second term is the characterization of the nucle- 
us of the CD 6600, as it was given earlier. The 
multiplication sign visualizes the fact that all 
algorithms (programs) must be forwarded through 
the peripheral processors first, in order to be 


processed then in the central processor (1,x10,60). 


Another example of contemporary computer ar- 
chitecture is PEPE (Parallel Element Processor En- 
semble). Its host is one CD 7600 with the charac- 
teristic 


t(CD 7600) = (15,1,12) x (1,x9,60). 
PEPE then becomes 
t(PEPE) = (15,1,12) x (1,x9,60) x («3,288 ,32) 


where the last term (x3,288,32) corresponds 
to the actual PEPE structure. As, in this example, 
a certain flow of information penetrates the three 
Structures, the sign x is used between the corre- 
sponding terms. 


The structures characterized by the primitive 
terms in these examples are very different. There- 
fore a further condensation of the presentation is 
not suggested. A further decomposition can be in- 
dicated, e.g. by the use of other operators, for 
instance in the special case of a CD 7600 by 


(15,1,12) x (1,x9,60) = 
[(1,1,12) + (1,1,12)+...+(1,1,12) ]x(1,x9,60), 


15 times 
where (n,d,w) = (1,d,w)+(1,d,w)+...(1,d,w). 


n times 


We note that the operators x and+ again re- 
flect parallelism and pipelining in a certain 
“sense. The last example shows 15 equal processors 
allocated in parallel. A given job (or task) will 
be forwarded to the central processor; It may 
also be necessary to allocate processors serially, 
if there are different tasks to be performed one 
after another. This is supported by the use of 
functional dedicated processors, specialized to 
the respective task. 


The last operator we have proposed so far is 
the ‘alternative’ operator v, which is to be un 
derstood as an ‘exclusive or'. For the C.mmp pro- 
ject which can be used in three different kinds 
of operation modes, an expression becomes: 


t(C.mmp) = (16,1,16) v (x16,1,16) v (1,16,16). 
Similarly, the EGPA project (4x4 array of 


processors, 32 bits each, described e.g. in [13]) 
reads 
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t(EGPA 4x4) = (16,1,32) v (x16,1,32) v 
(1,16,32) v (1,512,1).. 


The last term of this expression denotes the 
operation mode "vertical processing" in which the 
16 processors are used, each as if it consisted of 
32 one-bit processors working in parallel. 16 pro- 
cessors then result in an ensemble consisting of 
16x32 one-bit processors. Information then is or- 
iented to one-bit vertical streams (items) and the 
machine-word of the memory becomes what 1s called 
a 'bit-slice' in associative processors. 


The operator v visualizes alternatives regar- 
ding the processing modes which can basically be 
used. An extended operator + can be used for a 
further partitioning of a system in which the en- 
semble is working. Scheduling algorithms have to 
be developed which have to centre on the best 
utilization of the system with respect to a given 
set of jobs. The scheduling problems, however, are 
not covered by this paper. 


Yet, a remark on the ‘flexibility’ should be 
added. The number of available processing modes 
of a system seems to be a reasonable measure for 
its flexibility. Therefore we define (F=Flexibili- 


ty): 
F(t (MODEL) ) = 
(k, xk; sd, >d! wpe!) Vv (ky xk} 4d. xd) Wx! )V 


where || gives the number of triples connected 
by the v sign. | 


For the examples presented above we have: 
F(t(C.mmp)) = 3 and F(t(EGPA 4x4))= 4. 


In this section we wanted to show that a 
classification scheme becomes operable if it is 
carefully chosen. 


Nevertheless, it is not the aim of this pa- 
per to introduce ECS ‘@) completely. We have used 
it as a further example of the discussion about 
the ‘impact of classification schemes on computer 
architecture’. 


4. Summary and Outlook 


Some things which can be done with ECS (chap- 
ter 3), cannot be done with any of the systems 
mentioned earlier (chapter 2). Although we do not 
claim that ECS is the only possible classificati- 
on scheme, we have found it useful for evaluating 
computer structures, throughput, flexibility etc. 


In this respect ECS seems, as briefly presen- 
ted here, to be an approach which can become a vi- 
able design tool. It classifies enough objects and 
it does not limit too seriously the set of objects. 


(a) aia 
aly rigid and more formal presentation of ECS is 


under preparation. 


The only limitation we perceive so far is the inhe- 
rently binary nature of the definition of w (word- 
length). If a computer is based on another modulo- 
number system, then we would have to slightly modi- 
fy the ECS as presented. 


If, for historical reasons, we have to, for 
example, include the old mechanical calculating ma- 
chines of Ch. Babbage, then it would be necessary 
to extend ECS. Also excluded from ECS are computers 
of the analogue type. But this limitation seems to 
be quite natural in that analogue data processing 
is quite different. 


The only criticism which at this time can be 
made within the aims of this paper could center on 
the number of levels we introduced in chapter 3. 
There we defined a triple according to three pro- 
cessing levels. If perhaps in a later step of evo- 
lution a level above the program interpretation le- 
vel will be created, then we would have to extend 
the triple to a quadruple. 


But just this step to achieve a new level of 
computer structure is a real evolution step we are 
searching for at present. It was exactly for this 
that the classification scheme has been developed 
as a tool. About such an evolutionary step a deci- 
Sion cannot be made in advance. It is rather the 
ECS classification scheme and the operations de- 
fined on the elements (triples) which seem to be 
the appropriate starting point for investigations 
of that kind. We hope that ECS will not limit too 
narrowly a future development, for it includes al] 
structures which so far have been proved as viable 
examples of computer architecture. 
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Fig. 3: P.E. Enslow's definition of a multiproces- 
sor leads to “one common memory block" 
(private memory blocks, owned by a pro- 
cessor exclusively, are not excluded by 
the definition). 
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Fig. 6: Macropipelining (level 1 pipelining) 
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Abstract -- Program execution in some proces- 
sors (analog, array, pipeline, data-flow, single- 
assignment, etc.) reflects the structure of com- 
pound operations described in the user program. 
However, the original user description of these 
operations has to be first transformed (transpor- 
ted, translated, collected, interpreted) before 
the actual execution can begin. The structure of 
compound operations in this transformation can al- 
so be exploited (parallel or pipelined data trans- 
fer by I/O-devices and channels), and, under cer- 
tain conditions, even in transformation and execu- 
tion together (overlapped instruction fetch/ exe- 
cution in lookahead processors). An extensive 
application of this concept in the successive in- 
terpretation of (very) high-level languages is 
Suggested by the current trend of hardware prices. 


1. Introduction 


The exploitation of parallelism in computers 
has been preceded by the recognition of common 
Structural features of computations, at least at 
some levels. For example, the need for the repea- 
ted transport of programs and data sets from the 
periphery to main memory resulted in the use of 
multiple independent I/0-channels to the main pro- 
cessor, which perform this transport in parallel. 
Another example is the repeated transport of in- 
ctructions from the main memory to the processor, 
which resulted in the overlapping of the instruc- 
tion fetch and execution phases. There is also 
quite often a need for the repeated execution of 
Similar operations on elements of data arrays, 
which led to the construction of array and pipeline 
processors. 


One feels that the types of parallelism in 
these three examples are somehow different, but it 
is difficult to characterize this distinction us- 
ing Flynn's classification [1 ] according to whe- 
ther instruction streams and data streams are pro- 
cessed simultaneously. It is even sometimes not 
clear whether pipelining can be considered as MISD, 
i.e. multiple instruction stream/single data 
Stream processing, and if it is so, why (Enslow 
[2]). The classification scheme introduced by 
Handler [3],[4].,[5] which consists of the numbers 
of parallel and pipelining function units simul ta- 
neously active at the three main processing levels, 
bit operation level, machine instruction level and 
program level, possibly together with similar num- 
bers for I/0-, front-end or other coupled special 
processors, gives us much more information about 
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the computer. This measure is quantified, simi- 
larly to the parallelism measure introduced by 
Feng [6], thus enabling one to compare different 
machines according to their degree of parallelism, 
and also giving us a much more detailed picture 
of the machine structure, which is important for 
our intentions here. One should always explicitly 
State what level is considered in studies of pa- 
rallelism, etc.; speaking of the "parallelism of 
the ILLIAC IV computer" or of the "serial pro- 
cessing of the von Neumann computer" is of little 
value. 

We find it useful to consider the computa- 
tions together with the functions which they 
should implement. The user is interested only in 
the functions he wants to have computed by the 
machine, i.e. in the external behaviour of his 
program; from his point of view the machine has 
been constructed in order to execute these func- 
tions. Regretably, the machine has much more to 
do than this execution. The user program and data 
are mostly placed in some user space, e.g. a ter- 
minal, disc or tape, but the machine can perform 
the execution only when the instructions and data 
items involved have been brought into the machine 
execution space. After the execution another 
transport is necessary in order to give the re- 
sults back to the user and to free the execution 
space for the forthcoming execution. We would like 
to speak of a transformation rather than trans- 
port or transfer, since the action can also in- 
volve some encoding of instructions or data items; 
in a broader sense also, for example, program com- 
pilation or subprogram collection belong to this 
category. Thus, although the user's only aim is 
the execution of his program upon his actual data, 
or more precisely the execution of the functions 
composing the external behaviour of his program, 
the machine has to perform both the execution and 
transformation. Sometimes also the transformation 
may be explicitly programmed by the user so that 
its description constitutes a part of his program, 
but the characteristic of the transformation is 
that it does not influence the external behaviour 
of the user program. 


Now, considering the cases of parallel com- 
putations in a machine, we can divide them ac- 
cording to the category into which the implemented 
functions belong. In the first example above the 
parallel action of multiple I/0O-channels forms a 
part of the transformation. In the second example 
the instruction fetch belongs to the transforma- 
tion while the operation performed by the in- 
struction execution upon the actual data items 


often forms a part of the external behaviour of 
the user program. Operations performed during the 
parallel execution in an array processor belong in 
most cases to the external behaviour of the user 
program. We could speak therefore of transformation 
parallelism, transformation/execution parallelism 
and execution parallelism, respectively. We loo 
closer at the transformation and execution and at 
the potential inherent in the extensive exploita- 
tion of their common structure in the second part 
of the paper (sections 4,5). If the transformation 
is a transport, examples of parallel transforma- 
tion and execution at several levels in computers 
can be given: job execution overlapped with the 
transfer of the next jobs of the job stream from 
the periphery to the main memory, overlapped 
fetch/execution in lookahead processors, etc., and 
proposals for its exploitation have also been made 
across the whole storage hierarchy (cf. e.g. 
Dennis [7], Madnick [8]). The aim is to achieve the 
maximal possible execution speed with only very 
small run-time storage requirements, by having addi- 
tional processing capacity to perform the trans- 
port overlapped with the execution. The price of 
processing elements relative to memory has fallen 
rapidly. But the same principle could also be ex- 
ploited across the whole hierarchy of the succes- 
Sive interpretation of (very) high-level languages. 
Some proposals in this direction have been made 
e.g. by Miller and Cocke [9]. Consistent with the 
usage of the terms "compilation" and "“interpreta- 
tion" in processing of high-level languages, one 
could then speak of parallel compiled interpreta- 
tion. The aim here could be characterized as, in 
addition to that above, saving peripheral and mass 
Storage, for their prices remain also relatively 
high. Of course, the processing elements for the 
“on-fly" compilation must be more intelligent than 
for a simple transport, but this should be no 
problem today. A sufficient condition for the 
parallel transformation/execution is that the 
transformation preserves the program ordering 
(section 5). 


However, before turning to the transformation 
and execution, we consider in more detail their 
common structural features (sections 2,3). The term 
"parallelism" is namely by no means exhaustive for 
alle that can be observed in computations (nor 
even in the papers presented at this conference, 
so that the words "parallel processing" in its 
name are partially misleading). We start with the 
most natural computations which are performed in 
the evaluation of compound operations, as described 
by algebraic expressions and as has been exploited 
by man for several thousand years in analog devi- 
ces, and later e.g. in combinational circuits 
(section 2). We prefer to use standard terminology 
although it became quite modern in some places to 
Speak about “transitions”, "tokens" and "firing". 
Then we look at the structure of programs and 
machines. As for the notion of "structure", 
mentioned almost everywhere today, we find as its 
best explanation its usage in mathematics: "The 
fundamental structure problem of algebra is that of 
analyzing a given algebraic system into simpler 
components, from which the given system can be 
reconstructed by synthesis." (Birkhoff [10], p.55). 
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We begin with the synthesis of complex programs 
and machines from simple ones based on the ex- 
ploitation of similarity of their components. By 
"complex" we mean here and in the following "more 
intelligent". This synthesis has been mostly 
motivated by economical factors, e.g. at the time 
of the first electronic computers the number of 
memory cells and function units required for the 
execution of a computation had to be kept small. 
The synthesis can be roughly characterized as 


‘trading space for complexity", sometimes also as 
"trading space for time. Steps in the other di- 
rection, 1.e. towards the analysis of complex 
machines into simpler ones can be observed in the 
recent work in computer architecture towards dis- 
tributed processing, parallel processing (e.g. 
array and pipeline processors), or, as we prefer 
to say, structured processing (e.g. data flow 
machines and single assignment machines, cf. 
Tesler and Enea [11], Dennis [12], [7], Dennis 
and Misunas [13], Rumbaugh [14], [15], Plas et al 
[16], search mode and interconnection mode con- 
figurable computers, cf. Miller and Cocke [9], 
macropipelining, cf. Handler [17], and many other 
designs described as reconfigurable, restructu- 
rable, varistructured, variable, etc.). One could 
characterize this roughly as “trading complexity 
for space", when e.g. a complex central "Alles- 
konner" 1s replaced by a number of simpler di- 
stributed processing elements, and sometimes also 
as "trading time for space", when e.g. execution 
time is saved by the use of a greater number of 
processing elements, in accordance with the re- 
cent developments of hardware prices and the 
growing need to reduce execution time (section 3). 


2. Compound Operations and Related Computations 


The use of compound operations and their 
description by expressions is widespread not only 
in mathematics. Consider the very well know des- 
cription (1) of how to get the length of the 


a=3, b= 4 (1) 
hypotenuse of a right-angled triangle, given the 
lengths 3 and 4 of the sides adjacent to the right 
angle. Perhaps a more suggestive picture is (2). 


result c 


Note that (1) and (2) are essentially two drawings 
of the same graph where in the first drawing some 
details such as edges, circles for nodes and the © 
ordering of the argument nodes are omitted for 
reasons of economy (but are implicitly present), 
and where the shape of (1) is dictated by the ty- 
pographic needs of machine print. 


In order to be able to execute the described 
computation for the given arguments 3 and 4 (or, 
as algebraists may prefer to say, to evaluate at 
the point (3,4) the compound operation 

R* x R* > R* : (a,b) > c (3) 
corresponding to the expression Vaeab, cf. 
Gratzer [18]),we must first have learned at school 
that the operators denote certain operations on 
nonnegative real numbers Rt, i.e. we must know 


what specific algebra we are dealing with, cf.[18]. 


For example Vv denotes the square root ope- 
ration 

RY S425 Rh 2 set 
which sends a nonnegative real number,s that 
nonnegative real number t for which t’ = Given 


the number 25 as the argument in the following 
Simple computation description 


t= vs, = 25 
or, more explicitly, (4), the execution of the des- 
result t 
(4) 
argument 


25 


cribed computation (the evaluation of the opera- 
tion square root at the point 25) means that we 
determine the number 5 using our knowledge of the 
operation square root and having the argument 25. 
(The evaluation map 

+ 


e:Rt xR oR: (square root, 25) » 5 is very 


important in mathematics, cf. Mac Lane [19], p.18, 


61, 96, 216). We depict this in (5). 
5 
[ser a a (5) 
25 


The evaluation of a compound operation des- 
cribed by an expression such as (1) or (2), is de- 
fined inductively over the height of an operator 
occurrence in the expression, i.e. the length of 
the maximal path from the leaves to the correspon- 
ding node labelled by this Operator in the tree 
Such as (2). 


In detail: Let f' : A” + A denote the opera- 
tion corresponding to the n-ary operator f of the 
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algebra under consideration. Then the value of the 
compound operation corresponding to an expression 
in variables XqoXooeee aX at the point 


(a, 2a55-..2a,) € AK is defined by: 


(i) the value at a leaf-node labelled by Xs 


operator beA is b 


1S a. 
nou u 


by a O-ary 
(ii) if an n-ary operator f is the label of a node 


of height h 2 1 in the tree and by» Dos. --ob, 


are the values at its argument nodes, then 
oe (b, sbo5..-sb,) is the value at this node. 


In the above case, the induction proceeds as 
shown in (6). 


[ss {so r. 
fir —* AN = 
A AAA 
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For illustration, in these induction sens the 
values of the intermediate results are transformed 
by the operations and moved along the tree from the 
leaves to its root. 


The above interpretation and evaluation of 

(1) is as old as the usage of the expressions it- 
self. However, quite usual, and, in fact, also 

very old, is a physical implementation of this con- 
cept. If we have functional units for the required 
operations, the function (compound operation) des- 
cribed by (1), (2) can be implemented in the com- 
binational network (7), for example by moving and 


Output O 


(7) 


input 10 


transforming electrical signals along this tree 
from the leaves to its root. 


We could call the expression (1) (or the di- 
rected graph in (2)) a program scheme or machine 
scheme and the corresponding directed graph in (6), 
(7) a program or machine implementing the function 


(3), consistently with the common usage of these 
notions (cf. e.g. Arbib and Give'on [20]). 


The term “computation” is usually used for 
the sequence of intermediate results or configu- 
rations (consisting of the intermediate result and 
the state) of the corresponding program or machine 
for a given argument value (cf. e.g. [20], Elgot 
and Robinson [21], Elgot [22]), in accordance with 
the intuitive meaning of this notion. More appro- 
priate would perhaps be computation run, thus 
leaving the term computation to denote the set of 
all computation runs for all allowed argument 
values, similarly to the term "function" which can 
be interchangeably used for the set of all corres- 
ponding ordered pairs "“(argument, result)". Con- 
Sidering the ordered pair consisting of the first 
and the last element in the sequence of the inter- 
mediate results of each computation run, and the 
set of these pairs corresponding to the computa- 
tion (i.e. to the set of all computation runs), we 
get precisely the function implemented by the com- 
putation, sometimes called the external behaviour 
of the computation. In our case of unary and 
binary operators in (2), the computation run for 


the arguments (3,4) is the directed graph (8) of 


result 


the intermediate results (instead of a sequence as 
in the case of only unary operators) giving the 
assignment (3,4) » 5 as its external behaviour. 
For the set Rt x Rt of all admissible arguments we 
get a set of similar graphs as the corresponding 
computation, and their respective argument and 
result nodes give us precisely the required func- 
tion (3) as the external behaviour of this compu- 


tation (cf. Arbib and Give'on [20]). 


We remark that the only ordering of operator 
occurrences in (1), (2) and of the intermediate 
results in each computation run such as (8) is 
that which is induced by their argument-result 
relation, directly shown by the arrows in (2) and 
(8). Operator occurrences which are incomparable 
according to this partial order, as for example 
the two nodes labelled with the power operator 
"to" in (2), are often said to be inherently paral- 
lel. 


It was our intention to use standard termino- 
logy for well-known phenomena such as expressions, 
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their interpretation and evaluation. This is not 
only convenient but, moreover, it enables one to 
exploit results already known (cf. Give'on and 
Arbib [23] for a study of a structure of the com- 
pound operations described by a given operator 
set). Some authors prefer to use notions like 
“tokens which contain values", "an actor with a 
token on each of its input arcs" which is "enab- 
led and sometime later will fire" and quite often 
they speak of "data driven execution" in a simi- 
lar context. 


We note that the program (machine) scheme 
and the corresponding program (machine) have si- 
milar graph structures (cf. (2) and (7): the 
underlying directed graphs are indeed isomorphic). 
Because these graph structures are our main con- 
cern in the following, we shall feel free to use 
the notions scheme, program and machine inter- 
changeably, as convenient, hopefully without 
causing any confusion. 


3. Structure of Computations and Machines 


The notions computation run and computation, 
used in the last section, can be considered as 
the prior and most natural concepts from which 
the notions of "operation", operation composition", 
“compound operation" and the corresponding 
"operator" and "expression" are obtained by syn- 
thesis. Concepts such as "interpretation" and 
"implementation" represent steps in the other 
direction, i.e. analysis. To this extent we can 
speak of the structure of computations which is 
reflected in the most simple programs and machi- 
nes such as (2) and (7) of section 2. However to 
explain this in more detail is outside the scope 
of this paper so that we study only the struc- 
ture of programs and machines in the following, 
Starting with the simple machines of the last 
section. 

We depict in (9) once more the program 
(machine) implementing the compound operation 


RT x RY > Rt » corresponding to the expression 
Vae+b¢ . The situation is now more symetric 


result 


Ow 

(9 

@ Cd eas 
YO&% © ©® &y 


l.arg. 2.arg. 


) 


with respect to data items and operators, since 
operators can also be treated as data, e.g. they 
can be changed. The only operation acting upon all 


the data items and operators is the evaluation of 
section 2, which we left anonymous. We call the 
nodes such as those labelled a,b,2 or without la- 
bel in (9) data nodes and the nodes labelled with 
operators operator nodes. Every programmer would 
probably be inclined to speak instead of the data 
locations or variables and the (operator code part 
of) instruction locations, but our data nodes can 
actually be memory data locations as well as gene- 


ral purpose registers or data lines of a bus, and 


Our operator node can be the operator code part 

of the location in memory of an instruction, as 

well as a function unit implementing the corres- 
ponding operation or a control signal line of a 

bus. oe 


The implementation (description) of compu- . 
tations by the simplest machines (programs) such 
as (7), (9) iS not always economically feasible. 
For large computations this would require too many 
data and operator nodes. In the case of machines 
this means that the number of registers and func- 
tion units is too big; for programs their size is 
too large, and the schemes (expressions such as 
(1)) become clumsy. However, looking at these | 
Simple machines (programs, schemes) we observe the 
Similarity of certain parts: the same data items 
or operators occur repeatedly at different nodes, 
somtimes even rather large identical, or at least 


very similar parts of the machine occur repeatedly. 


In what follows we describe several quite common 
cases of synthesis, where such similar parts of 
machines (programs, schemes) are "coalesced". This 
happens at the cost of simplicity, since some new 
mechanisms such as control flow, subtroutine call 
must then be introduced into the machines (pro- 
grams, schemes). Sometimes the saving of nodes is 
Outweighed by the introduction of a new dimension 
into the machines (programs, schemes), the time. 
For each case of synthesis we show also examples 
of analysis of the complex machines into a greater 
number of simple ones, motivated by the recent de- 
velopment of hardware prices and by the. increasing 
need to reduce execution time. The external be- 
haviour of the corresponding computations remains 
in all cases unchanged, i.e. both the synthesis 
and analysis are "Semantics preserving". 


Multiple use of data nodes. Looking at the 
example of the machine (9) we see that the two 
occurrences of the data item "2" could be coales- 
ced, thus obtaining the machine (10). The under- 
lying directed graph is then no more a tree (cf. 
Arbib and Give'on [20]). This technique is quite 
common in programming. Such a saving of data nodes 
costs increased complexity of the machine: the 
data item "2" must be available for two references 
to it; increased execution time may also result 
if, caused by technical circumstances, one such 
multiple reference must wait for the completion 
of another. 7 


Analysis example: the replication (broad- 
casting) of an argument with multiple references 
used in the packet communication architecture 
(cf. Dennis [7]}). 
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Multiple use of operator nodes. The two nodes in 
(10) bearing the label "pow." can also be coales- 
ced, cf. (11). A typical example is the use of 

function units in a centralized processor: there 
will be only one function unit for the operation 


"now." which will satisfy all references to it. 
result | 


Compared to the function units of (9), this happens 
at the cost of increased complexity of the single 
function unit in (11) which must be able to resol- 
ve multiple references; increased execution time 
may also result if one such reference must wait 
for the completion of another. A second example is 
the use of array operators in programming langua- 
ges. 


Analysis example: the provision of multiple 
function units for more frequent operations, e.g. 
two increment units in CD 6600. Another example is 
the provision of multiple function units for ope- 
rations on data arrays in array processors, or the 
replication (broadcasting) of an array operator 
when executed on an array processor. 


Re-use of data nodes. In the labelled graph 
(11) we have used explicitly the letters p, r, s, 
t, Us V, Ws X, Ys Z for the nodes. Another possib- 
le representation of this graph is (12) which shows 
in another form the (finite) maps of the labelling 


p=2, r=pow., s=a, t=b, u=r(s,p), ver(t.P), 


w=+, X=w(uU,V), y=v , z=y(x) (12) 
and assignment of nodes to edges depicted in (11). 
Now, observing the chain s?*u>x>z of data nodes in 
(11) with the property that none of them has other 
immediate successors, we can use four copies 
(1,s), (2,8), (3,8), (4,8) of the same node s, 
with the chain ordering of 1+2+3+4, instead of the 
chain seu>x>z. (An algebraist would say we take 
the direct product of ft} with the chain 1+2+3+4 
together with the product order relation.) We get 
the description (13). If we call the chain 

1+2+3+4 a "time sequence", we can say that the 


p=2, r=pow., (1,s)=a, t=b, (2,s)=r((1.s),p); 
ver(t,p), wet, (3,S)=w((2,S),Vv); 
y=¥, (4,8)=y(3,8) 


(13) 


three data nodes u, x, z have been saved by re- 


using the data node s at three other time instants. 


The price to be paid is increased complexity of 
the machine, because we need a clock giving the 
time impulses 1+2+3-4; increased execution time 
may also result if the clock is slower than the 
gate times and conducting delays along the path 
Stu>x*z. Note that the program (machine) (13) can- 
not be depicted as a graph like (11) without in- 
troducing some new description conventions. Nor 
can 1t be implemented as a combinational network, 
Since the clock functions as a delay unit bet- 
ween two successive uses of the data node s; we 
get a sequential network (cf. for example Hennie 
[24], chapter 1, for a discussion of combinational 
network/sequential network dichotomy in finite 
automata implementation). 


We can Save yet more data nodes by extended 
re-using of data nodes (14). Three data nodes, 
p, s and t, are sufficient instead of the seven 
data nodes in (11). However, the price to be paid 
is increased execution time by additional ordering 


of operators which were inherently parallel in(11). 


1. p=2; 2. r=pow.; 3. s=a; 4. t=b; 
0. S=r(S,P); 6. p=r(t,p); 7. Wet; (14) 


8. s=w(S,p); 9. y=v ; 10. s=y(s). 


We use in (14) the convention that the value of 
the new ordering parameter 1+2+3-4>5-6>7+8+9-10 is 
written in front of a node on the left-hand side 
of "=" and the implicit assumption that an occur- 
rence of a node on the right-hand side of "=" 
Should be indexed with the last previously occu- 
ring index at this node; e.g. the full description 
of the fifth assignment would be (5,s) = (2,r) 


Css (1 sp): 


Analysis example: the iterative array imple- 


mentation of sequential circuits (cf. Hennie [24]). 


This is to some extent the principle employed in 
pipeline processors, data flow machines (cf. [7], 
[12], [13], [14]), single assignment machines 

(cf. [11], [16]), interconnection mode configurab- 
le machines (cf. [9]) and macro-pipelining (cf. 


[17]). 
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Re-use of operator nodes. Because of the 
symmetry of the programs (machines) with regard 
to the data and operator nodes, the same reason- 
ing holds here as above. Typical examples are re- 
definable operators in some interpretive pro- 
gramming languages, e.g. Snobol, or micropro- 
grammable function units in some computers. 


Analysis example: the replacement of the all- 
purpose ALU of the CD 6400 by the dedicated func- 
tion units in the CD 6600. 


Shared use of description parts branching) 
Let the two programs (machines "h a) contain 
identical parts consisting of &+ . Then we can 
join them to the single program (machine) (15b), 
thus saving data and operator nodes. The price 

to be paid is increased complexity, viz. the in- 
troduction of a new mechanism, the control flow 
into the program (machine). 


a) b) 


——fP 
_ at 


—b 


fe) 


The operators f, fi> g, h in (15a) have then to 
be extended to f', f,, g', h' in which a new 
control flow parameter is taken into account. 
Examples can be found in ordinary programming. 


Consider another case of the two programs 
(16a) where the domains as well as the value sets 


b) 
i tf DO GQ 
f (16) 
tg 94 
fi ff, (1 (45) 


of both pairs of the operations g, g, and h, hy 
are disjoint. Then we can use the sifigle pro- 

gram (16b) with g., h, being unions of the above 
pairs and with h, additionally producing a truth 
value for the cohtrol flow branch (cf. Elgot [25]). 


Analysis example: the step from centralized 
to distributed control, e.g. from centralized 
"star" or bus interconnection to decentralized 
full interconnection of computer modules (cf. 
Anderson and Jensen [26]). 


a) 


= a 

a oe 
J 
a) 


Re-use of program parts (subtroutine calls). 
part consisting of the operations 


t_a 
4 & occur repeatedly in the program (17a). 


a). b) 


Then we can save data and operator nodes if we in- 
stead use multiple copies (+8, i) (i=1,2,3) of 
the repeatedly occuring part (see (17b)), di- 
Stinguished by a new parameter i. Again, calling 
1+2+3 "time" we can say that we use the same pro- 
gram part in three different time instants. In 
detail: if the repeated occurrences are 


O O-—[s}+ +O (i=1,2,3) (18) 
XU: y; Vi 2, 


with data nodes Xi9 Yas Zz. and operator nodes. 


Us. Vas then the use of (19) instead of (18) 


; 
(O O—lal-+0 »4) (i=1,2,3) (19) 
a, u y —V Z 


represents normal subroutine calls while (20) re- 
presents re-entrant subroutine calls. The price to 
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be paid is increased complexity, viz. the neces- 
sary introduction of a new mechanism into the pro- 
gram, the subroutine call. Increased execution 
time may also result if one of the multiple calls 
of the same subroutine must wait for the comple- 
tion of another. 

Analysis example: some compilers generate 
"inline code” for each call of an intrinsic or 
library function subroutine, i.e. they generate 
full object code of the subroutine at each place 
corresponding to a subroutine call in the source 
program. 


4. Transformation and Execution of Programs 


In the last section we considered generally 
the structure of computations. In this section we 
want to make a difference between the computation 
implementing the external behaviour of the user 
program on his data, which we call simply execu- 
tion in the remainder of the paper, and the com- 
putation performing the transformation of the 
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user program and data, as outlined in section 1. 


There exist different methods to perform the 
transformation necessary for the execution of a 
program upon its actual data. 


| Global program/global data transformation: 
First the whole program and all initial data are 
transformed, then executed,and then the final re- 
sults are transformed back. The following are some 


significant features: Positive: (a) Because the 
whole transformation is performed in one piece, 


before and after the execution respectively, it 


is possible to analyze the transformation, as de- 
scribed in section 3, to the extent this is econo- 
mically feasible (e.g. several I/0-channels which 


can transport simultaneously several data files 


needed for one program). (b) Because the whole 
execution is performed in one piece, it is possib- 
le to analyze the execution, as described in 
section 3, to the extent this is economically 
feasible (e.g. array processors, associative pro- 
cessors, arithmetic pipelining). Negative: (c 
Large storage capacity is required for program and 
data in the (expensive) execution space (e.g. main 
memory required to store a large compiled program 
and the large data arrays to be processed by this 
program). (d) Too much transformation is perfor- 
med on programs where only a small part of all 
instructions (operators) and data items actually 
occur during execution (e.g. loading of a segment 
of a segmented executable program when it is acti- 
vated, where the whole program segment is trans- 
ported into the main memory although possibly only 
a very small part of the segment will actually be 
executed; or a paging machine where a whole page 
of data is transported although possibly only a 
few data items of the page will actually be pro- 
cessed). 


Local program/local data transformation: for 
a single user program statement that has actually 
received control: first the instruction (operator) 
and data items which are the arguments are trans- 
formed, then the corresponding function is evalua- 
ted, and finally the results are transformed back. 


Significant features: Positive: (a) Small storage 
capacity is required for program and data in the 
execution space (e.g. a simple machine with a few 
registers into which the function code and argu- 
ments are loaded for evaluation). Negative: (b 
Because transformation and execution are inter- 
leaved in small slices, they both can be analyzed 
to only a very small extent. (c) For the same 
reason as (b), the transformation delays the exe- 
cution (e.g. a high-level language interpreter). 
(d) Too much transformation is performed on pro- 
grams in which many of the instructions (opera- 
tors) and data items occur repeatedly during exe- 
cution (e.g. a high-level language program per- 
forming a large number of iterative computing 
Steps, which is interpreted by a language inter- 
preter). : 


There are also intermediate or mixed methods 
of performing transformations which try to ex- 
ploit some advantages and avoid some drawbacks of 
the two extreme methods given above. 


Global program/local data transformation: the 
whole program is transformed before and after the 
whole execution, while data items are transformed 
only when required for the current execution, and 
then transformed back again (e.g. an executable 
program processing direct access disc files). 


Combined global/local transformation accor- 


ding to the assumed number of occurrences of in- 
Structions (operators) and data items during the 
execution: the most frequent are transformed glo- 
bally into the execution space before and after 


the whole execution, while the remaining are trans- 


formed locally only when required for the current 
execution (e.g. parts of an operating system or a 
hierarchy of subprogram libraries in general pur- 
pose applications). A modification of this method 
is the dynamic global/ local transformation: in- 
structions (operators) or data making up the glo- 
bally transformed items in the execution space will 
not stay there for the whole execution but may be 
exchanged for locally transformed items, e.g. if 
they have not been involved in execution for a 
long time (e.g. usage of general purpose registers 
of a processor by executable programs which load 
them with some data more frequently required for 
execution and replace them later by others; an- 
other example is the cache memory, or the throw- 
away compiling, cf. Brown [27]). 


Blockwise transformation: Program and data 
are partitioned into blocks; any of these blocks 
is transformed into the execution space whenever 
an item it contains is required for execution, and 


it is transformed back when an item outside of this 


block is required (e.g. segmented loading of pro- 
grams, paging machines). 


These intermediate methods depend heavily on 
the specification of the portions of programs or 
data which are to be transformed locally and glo- 
bally, respectively. If badly specified, they can 
result in much worse overall performance than in 
the first two methods (e.g. columnwise processing 
of a large matrix on a rowwise paging machine). 


5. Common Analysis of the Transformation and 
Execution - 


The global program/global data transformation 
is an extreme case which allows very fast execu- 
tion but requires very much storage in the exe- 
cution space. If the storage capacity does not 
suffice, one has to use some of the intermediate 
methods. In these, however, the interleaved trans- 
formation delays the execution, as is seen most 
clearly in the opposite extreme case of the local 
program/local data transformation. If the machine 
were able to transform locally the items required 
for the next execution during the current exe- 
cution, and simultaneously to transform back the 
items which have been involved in the foregoing 
execution, it could achieve an execution speed 
comparable to execution after global transforma- 
tion. This would mean analyzing the transformation 
and the execution together, as described in sec- 
tion 3. However, for the transformation of the 
items which will be involved in the next execution 
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the machine must first decide which these will be. 
It is not always possible to give a precise answer. 
The following method offers a rather ad hoc but 
easy implemented solution: 


Neighbourhood: One expects that the items in 
some neighbourhood of those involved in the cur- 
rent execution will be required for the next exe- 
cution and transforms them (cf. blockwise trans- 
formation, section 4) into the execution space, 
Simultaneously with the current execution (e.g. 
hierarchical storage reorganisation, Madnick [8]). 
Thus very fast execution may result, but in the 
least favourable case the execution speed can be 
worse than in the local transformation method, 
while much more storage is required in the execu- 
tion space. 


However, there is a lot of information in the 
user program about the possible next items. Let X 
be the set of user's variables, L set of user's 
labels, and assume that 
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is a part of his program, with labels 4 
and 


Lee L, where x,y are sequences (Xyo--+ 5X, 
(Yyo-+++9¥y) of variables of X. The user's reference 
manual interpretation, which we denote by I, then 
assigns to f some (partial) function If: 


Ay era ea > (Aud Ke oa ® Ried x pas its 


meaning, where p denotes the set {0,1,2,...,p-l}. 
The letters tye. oU, denote some ele- 


ments of the set T of the allowed data types and 
f.. isines A the underlying sets, i.e. the sets 


‘ ” of possible values on which the 
function If operates. Thus If assigns to arguments 
2a, of the required types in the domain of 


es 


Fea 


If some Dy5---5D of specified types as its resul- 
ts, and a truth value j as one of its possible p 
outcomes. In accordance with the reference manua|] 
interpretation I, the instruction labelled with } 


would be decoded as follows: If Apes od, are the 


respective values of Xpovee eX and if the value of 


If at ( se.) bs CD sb 35), then assign to 


"i 


Ais aiens she 
1? 1? 

the variables y,,...,y, the values b,,...,b,, resp. 

and for the next action refer to the’ label Vee 


Thus each program statement specifies all its 
possible direct successors. If the transformation 
preserves this partial ordering, the possible di- 
rect successors of the current executed program 
Statement of the transformed program can also be 
Specified. 


Without going into further details, we call 


the transformation order preserving, if it con- 
sists of a map F of programs and a map y of data 
with the following properties: F maps the data 
types, T>T' : tet', into the data types of the 
transformed programs and y maps bijectively the 
data items, AA a for each teT, into the trans- 
formed data items; F sends each function 
name (instruction, operator) of a type 


(t,...t ou).--U >P) into a function description 


(program) f' of the type (t,...t sU)---U sP) which 


is to be interpreted by the interpretation I' over 
the sets A,, (t'€T') of the transformed programs 
so that always (y'xl_)olIf = I'f'oy™; final- 
ly we require that F map§ for each user program P 
injectively the user variables, X+X':x»x', and 
labels, LoL':l»1', into the variables and labels 
of the transformed program P', respectively, and 
sends each statement iseR Ay getsela) of 


P into 1'3f':x'>(y's31: seees1')) in P's with 
ma oe i ' Sn 

X'=(X pores eX a) if X=(Xjo+++9X0)- 

ons on (F,¥) ensure that each user's program state- 
ment and data item can be transformed independen- 
tly of other items and that the transformed pro- 
gram will process under the interpretation I' the 
transformed data as the user expects, considering 
his source program, data, and interpretation I 
only. In order preserving transformations one can 


apply lookahead methods for the transformation and 
execution. 


These conditi- 


Partial lookahead: Some of the possible direct 
Successor user program statements are chosen and 
the corresponding items are transformed during the 
current execution. If the current execution has 
another outcome than expected, execution is delay- 
ed and the actually required items have to be 
transformed first (e.g. lookahead processors). The 
next method seems to be the most promising. 


Total lookahead: For alle possible direct suc- 
cessor user program lines the corresponding items 
are transformed simultaneously with the current 
execution (e.g. a user sitting at a demand termi- 
nal and waiting a long time for the outcome of the 
currently executed job control command, who al- 
ready pretypes onto the screen the job control 
line for each of the possible outcomes). Of course, 
if the transformation is so complicated that it 
lasts longer than the execution, then more advan- 
ced possible successors must also be taken int 
account. 


From the above conditions on the transforma- 
tion it follows that the transformation preserves 
the "shape" of the program and data (cf. Goguen 
[28]), as in the case of their transport (trans- 
fer), or "refines" the program instructions and 
operators as subprograms, and data items as data 
constructs in a lower-level language, what we call 
a successive interpretation. If we do not insist 
that the transformation transforms instructions 
and operators f of the source program "pointwise", 
but allow f to be a source program subroutine, 
than the transformation also involves, for example, 
the analysis of a complex program into its “un- 
folded" version, as described in section 3 (cf. 
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e.g. Ramamoorthy and Gonzalez [29] for a survey 
of some techniques). 


6. Conclusion 


We have explored the common structure of the 
transformation and execution of programs. The 
notion of transformation is general enough to 
include not only transport but also high-level 
language translation, subroutine calls, emulation, 
hardware implementation of functions, etc. Most _ 
of machine data processing consists of a hierarchy 
of successive transformations where lookahead me- 
thods can be applied, if these transformations 
preserve the ordering of the instructions and 
operators in a source program. Our aim was to at- 
tract more attention to the potential inherent in 
the successive interpretation of (very) high-level. 
languages and the possibility of the exploitation 
of lookahead methods for the compiled interpre- 
tation, in accordance with the recent develop- 
ments of hardware prices. | 
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Berlin, Germany 


Abstract -- The paper summarizes previous 
and contains new solutions for the problem to con- 


struct schedules for a set of independent tasks 

to be executed on several processors. For each 
task requestline and deadline for execution and 
the computation time required on any processor are 
known in advance. 


1. Introduction 


A general monitor system, GMS, consists of a 
finite set of independent tasks, T, each task 
having an individual requestline, RL(T), an indi- 
vidual "hard" deadline, DL(T), and an individual 
computation time, CT(T), where 


O < CT(T) S DL(T) - RL(T) is assumed. 


The problem is to construct schedules for a 
GMS and a minimal number of identical, independ- 
ent processors of finite speed. If only one pro- 
cessor is available the general solution for a 
GMS is derived in [SI]. Restricted monitor sys- 
tems were investigated previously by Liu/Layland 
[LL] and by one of the authors [S2], the latter 
considering the case of several processors. An- 
other special case was studied by Labetoulle [La], 
assuming a single processor system. 


The first result of this paper is 

- a method for calculating the minimal num- 
ber of processors required for executing a given 
GMS without violating constraints; 

- a scheduling scheme which describes the 
class of all preemptive schedules for a given GMS 
and a given number of processors. 


We are interested in "classes of schedules" 

in order to be able to.care for additional con- 
straints imposed on solutions by reality (see 
[SL] for further explanations). 


Basically a scheduling scheme consists of two al- 
gorithms (see figure 2): 

- the first algorithm computes the set of 
"admissible" assignments of the GMS; 

- the second algorithm computes the maximal 
running time for the admissible assignment from 
this set selected for execution. 


all 


The second result of this paper 
algorithm for a GMS having some 


is a scheduling 
relatively weak 


additional property. [H] is the full version of 
this paper including all proofs of correctness of 
the algorithms derived. 


2. Notions and definitions 


A general monitor system, GMS:= (I,RL,DL,CT), 


is defined to be a finite set T of tasks and three 


mappings for the requestlines, deadlines and com- 
putation times 


RL : T+R 
DL T>R 
CT T>R 


The computation time, CT(T), of a task T gives the 
time required to execute T completely on any of 
the available processors. The processing of a 

task T cannot begin before its requestline, RL(T), 
and must be completed before its deadline, DL(T). 
Because of the graphical representation of the 

GMS chosen in this paper it is sometimes reason- 
able to speak about lengths of tasks instead of 
computation times of these tasks (see figure 1). 


A processor system consists of a finite set 
of independent and identical processors which are 
able to process the tasks with a constant, posi- 
tive, and finite speed. The units of length and 
time are determined such that one processor re- 
duces the length of a task by one in one time 
unit. We exclude that several processors simul- 
taneously execute one task or that one processor 
execute several tasks simultaneously. 

X CT is called an assignmert if all tasks of X 
are processed simultaneously for some time, the 
running time t* of the assignment X. We say a 
preemption occurs if a processor executing a task 
is interrupted before the length of the task has 
been reduced to zero. 


GMS;:= (T;,RL;,DL;,CT;) denotes the remaining GMS 
after a finite sequence of assignments with total 
running time t has been executed. 


A schedule S for a given GMS and for a given num- 
ber of processors is determined by a finite se- 
quence of assignments X and the respective running 
times t“, compatible with the requirements of the 
GMS. 


A schedule for a GMS and M processors is called 


=? r 
a 
qT. ; qT, | | 
| 
T T 


Representation of a GMS, processing 
takes place from right to left. The 
requestlines, deadlines and computation 
times are the following: 


RL(T}) = 0, DL(T,) = he CT(T,) a 
RL(Tj) = 0, DL(Ty) = 2, CT(T>) 
RL(T3) = 3, DL(T3) = 4, CT(T3) 

] 2 

3 4 

0 


TT 


RL(T,) = 1, DL(T,) = 2, CT(T;) 
RL (Ts) _ 9 DL(Ts) = 9 CT(Ts) 
RL(T¢) = > DL(T¢) = a5 CT(T¢) 


Note that Y = (19,Tg) is not contained 
in the set of admissible assignments 
at time t=O, because if executed for 
an arbitrarily small time, t=e, the 
GMS, is over-critical in the k-inter- 
val [1,2] U [3,4]. Thus T, and T> must 
be assigned at first. 


optimal iff there is no other schedule for this 
GMS and less than M processors. 


The following definitions are required for 
the description of the algorithms. 


A single interval begins at some requestline and 
terminates at some deadline of some task(s). A 
multiple interval (or k-interval, k € N) is the 
union of a finite number of single intervals. The 
minimal load, MINLOAD (GMS,A), of a k-interval A 
of a GMS is given by the sum over those parts of 
tasks of the GMS which at least must be processed 
inside of A, because they cannot be executed 
outside of A without violating constraints de- 
fined by the GMS (see figure !). 


For a given number of processors, M, a k-interval 
A of a GMS is called critical iff the condition 


MINLOAD (GMS,A) = M- length(A) 


holds. It is called over-critical iff the condi- 
tion 

MINLOAD(GMS,A) > M- lenght (A) 
holds. 


The length of a GMS, L(GMS), is defined to be the 
length of the interval between its first request~— 
line and its last deadline. Then a GMS is called 

adjusted (with respect to M processors) iff the 


ZF 


condition 


2, CT(T) = M+ L(GMS) 

TET 
is fulfilled and none of its k-intervals is over- 
critical. Obviously the first property can be ob- 
tained in a trivial way. For an adjusted GMS an 
assignment is called admissible iff executing it 
for an arbitrarily small time t, t > O, implies 
GMS; is adjusted. The longest running time of an 
assignment, X, with this last property is called 
maximal running time ae 
Note: Obviously an adjusted GMS contains at least 
M requested tasks. 


The basic idea of this paper is the following: 

- In order to determine M, the minimal num- 
ber of processors for executing a given GMS com- 
pletely, consider the minimal load density, de- 
fined by MINLOAD(GMS,A)/length(A) for all k-inter- 
vals, A, and calculate the maximum. The next higher 
integer is M. | 

- In order to construct an optimal schedule 
for the GMS and M processors control the minimal 
load density of all k-intervals of the remaining 
GMSt such that none of them exceeds this bound M, 
by choosing appropriate assignments and running 
times. 
Any scheduling algorithm obeying this principle 
generates optimal schedules. Moreover, all opti- 
mal schedules can be described in this way. 


3. Results for the general case 


Theorem 1: 


Let an adjusted general monitor system, GMS;,, at 
time t and an M-processor system be given. Then 
the subsequent algorithm Al computes the non-empty 
set of all admissible assignments. 


Al: 


Input : GMSt,M 


Step 1: Compute the set, Tes of all requested 


tasks, T,, fulfilling the condition 


For each critical k-interval, not begin- 
ning at t, compute the set of all request- 
ed tasks making it over-critical, unless 
assigned immediately. The union of all 
these sets is called Tye 


Step 2: 


For each critical k-interval, beginning 
at t, compute the set of all requested 
tasks contributing to its MINLOAD. The 
intersection of all these sets is called 


3 
i? . 


Set of admissible assignments, AA,, de- 
fined by 


Step 3: 


Output: 


AA;: 


se emadimeed 


(TEU TZU y) | YeT? ~ IyI = 
M- [thu T213 ; i.e. 
Rpeere 3 
(XI ixl=matlu tex 13 


Remarks 


For an adjusted GMS the interval [t,L(GMS,) ] is 
always critical at time t.. Thus all requested 
tasks may belong to admissible assignments. More- 
over, both definitions of AA coincide because 
eM fe Stee | 


Theorem 2: 


Let an adjusted general monitor system, GMS; at 
time t and an M-processor system be given and let 
X be arbitrarily chosen from the set of admissi- 
ble assignments, AAt, calculated by algorithm Al. 
Then the respective maximal running time, tpax> 
can be computed by the subsequent algorithm A2. 


AQ: 
- Input : GMS | »X,M 
Step 1: For ae task, ae assigned by X, 
compute the time t' = DL(Tt) - CT(Tr). 
Step 2 : For each er not beginning at t. 
compute the next point in time, t' >t, 
at which the tasks not assigned by X 
| . would make it over-critical. 
Step 3 : For each critical k-interval beginning 


at t compute the next point in time 
t' > t at which a task assigned by X 
would no longer contribute to its 
MINLOAD. 


Step 4 : Compute the minimun, ee of all the 
above t' and of the computation times 
of the tasks assigned in X. 


° e e X 
Output : Maximal running time t : 
: max 


Remarks end Theorem 2 


For each k-interval the t' from steps 2 and 3 can 
be computed by an easy and computational efficient 
algorithm, omitted here because of its notational 
complexity. But note that the number of k-inter- 
vals may be exponential in the number of tasks. 


Theorem 3: 


Let a general monitor system, GMS, and the number 
of processors, M, be given such that 


> [max { MINLOAD(GMS,A) / length (A) 
A is k-interval of the GMS } | 


Let SA(GMS,M) denote the set of scheduling algo- 
rithms for the GMS and M processors obtained from 
the scheduling algorithm scheme in figure 2 by all 
deterministic interpretations of the starred lines, 
i.e. by all choice algorithms replacing these two 
lines. 


Let S(GMS,M) denote the set of all schedules for 
the GMS on M processors obtained by applying all 
SA €SA(GMS,M) to the GMS. 


Then S(GMS,M) is the set of all optimal schedules 
for the GMS on M processors. 


such that 
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Let S€ S(GMS,M) be obtained by a ean algor- 
ithm from SA(GMS,M) always choosing tX for 
all assignments. Then S consists of at most 49 N2 
assignments. 


Input: Adjusted GMS, »M, t:=0, S:=@ 
SS ee ae 
Y. 


Compute the set of all admissible 


assignments, AA. (by ane 
‘Choose any assignment X € AA. | (x ) 
v ; 
Compute the maximal running time 
of x, tX aby ne): 
ax ¥ | 
Choose any running time t , (ee) 


X. -x 
o<e%s thay 


¥ 
Update the representation of the 
Extend the schedule obtained so 
far S:= S| (x,t%) 


| gin fs ; Vv , 
{NO Are all tasks of the GMS =. 
\. completely executed? 


me: 


YES 
V 


Schedule § 


Output: 


a Ve 


END 


Exeure..c 


scheduling algorithm scheme 


end Theorem 3 


Remarks 


The optimal schedules of an adjusted GMS are cha~ 
racterized by the adjustment of the remaining GMS, 
at any point of time t. This also implies that 

the admissible assignments are those assignments 
consisting of requested tasks which can be exe- 
cuted for an arbitrarily small time without in- 
creasing M in order to be able to execute the re- 
maining GMS completely. 


4. Scheduling algorithms for the case MINLOAD > 1 


[H] contains various special cases, charac- 
terized by additional assumptions about the given 
GMS, allowing us to find an efficient scheduling 
algorithm. One of them is briefly discussed in 
this paper subsequently. 


For a given GMS let {Ij, i=l,...,i0} denote 
the set of disjoint intervals in [0,L(GMS)] de- 
fined by all requestlines and deadlines as bound- 
aries of the Ij; let Ij; be located lower than I; 
if i < j. The GMS is called q-simple iff 
MINLOAD(GMS,I;) 2 q: length(I;) for all i=1,...,i0. 
(For the rest of the section we additionally as- 
sume that q is the largest such number. This ad- 
ditional assumption is done for simplicity of pre- 
sentation but without any deeper relevance and 
can be omitted easily.) 


For increasing q the property of a GMS to be 
q-simple obviously becomes more and more restric~ 
tive. For q=l figure 3 shows that this assumption 
does not exclude many technically interesting 
problems; especially for M=2 it is relatively 
weak. In the sequal we describe an efficient sche- 
duling algorithm based on this assumption allowing 
to reduce the M-processor problem to a single pro- 
cessor problem (it is not difficult to see that 
this assumption can be further weakened without 
loosing this reducibility). 


In order to explain this reduction process 
we start with an M-processor system, a GMS being 
adjusted and nowhere over-critical (with respect 
to M) and being (M-1)-simple. From the (M-1)-sim- 
plicity we see that uniquely defined parts of 
tasks of the GMS mut be processed in uniquely de- 
fined intervals I; (defining a set POT;) and that 
the total length of these pieces of tasks in Ij 
(i.e. of the pieces in POT;) is equal to 
(M-1)-length(I;), i=l,...,10. 


A shoxt moment's reflection shows how to de- 
rive a GMS and a GMS’ from the given GMS: 


- The cus! consists of the POT; to be pro- 
cessed in Iz, 1=1,...,10, 


- Ms! consists of those pieces of tasks in 
GMS not contained in a POTj restricted by the ori- 
ginal requestlines and deadlines; if a piece of a 
task is in a POT; then for the remainder of this 
task in the GMS* there is an exclusion interval, 


EIj © Ij, where this remainder may not be process- 
ed. 


The cus” may be scheduled for M-1 processors 
by a trivial scheduling algorithm [C, page 76]. 
The following observation is now important: We can 
always determine a particular schedule for the 
cms"! on M-1 processors such that its E;'s do not 
exclude the schedule for the GMS! on the remaining 
processor derivable by the DDEI-scheduling algor- 
ithm (defined below). This follows from the open 
Iy's property that they do not contain any request- 
line or deadline. 


In order to schedule the cus! for the remain- 
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ing processor we use a modification of the dead- 
line driven scheduling algorithm cited in [S82], 
such that the exclusion intervals of all tasks 
are taken into account. We call this modification 
DDEI-scheduling algorithm (for deadline driven 
with exclusion intervals) and define it by means 
of the tasks' modified deadlines. Given exclusion 
intervals for a GMS! at any instant, t, the modi- 
fied deadline of T from cms! , MDL(T), is defined 
to be DL(T) - SLEI(T) - t, where SLEI(T) denotes 
the sum of the lengths of all exclusion intervals 
for T in [t,DL(T)]. The DDEI-scheduling algorithm 
now simply prescribes to schedule at any time a 
task (from the set of all tasks being requested 
and not yet completely scheduled and not entering 
into an exclusion interval) with smallest modified 
deadline (for all tasks from this set). 


5. Conclusion 


Obviously the GMS scheduling problem can be 
considered as a special graph scheduling problem; 
thus the former problem is simpler than the latter 
one. 


For the preemptive case and the graph schedu- 
ling problem presently finite scheduling algorithms 
are not known for M> 2 (the proof of polynomial 
completeness by Ullman refers to a somewhat differ- 
ent problem, [4]). 


For the preemptive case and the GMS scheduling 
problem we presented finite scheduling algorithms 
of complexity 0(2N), producing the class of all 
schedules for arbitrary M (choosing maximal running 
times for all assignments their number is bounded 
by O(N2) in each schedule). 


For arbitrary M we were not able to derive 
polynomial bounded algorithms nor were we able to 
show the polynomial completeness of the problem. 
In order to extend the knowledge about this prob- 
lem it thus seems reasonable to look for "sub- 
optimal" heuristic scheduling algorithms for a GMS 
or additional assumptions about GMS's, reducing the 
complexity of the problem. An example of how such 
additional assumptions may look like and how weak 
they may be is discussed in section 4. Both ap- 
proaches surely will be successful if it is pos- 
sible to reduce the number of k-intervals to be 
considered in Al and A2 to be bounded by a poly- 
nomial in N. Such results obtained by additional 
assumptions can be found in [H]. Presently we in- 
vestigate scheduling algorithms obtained by choos- 
ing feasible subsets (of polynomially bounded car- 
dinality) of the set of all k-intervals. 
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Figure 3a: Representation of a GMS given by the 
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requestlines, deadlines and computation 
times, respectively, of the tasks: 


Ty ~ (O,1,1,); D2 (053,255) % 
13° €15452.5)% T4 = (3,4,1); 

T5 =~ (0,4,1). 

The intervals to be checked are 

I; = [0,1]; 12 = [1,3]; 13 =[(3,4]. 


The parts of the tasks to be executed 
in such an Ij on the first processor 
are shown as bold lines. The thin lines 
show the remainder for the GMS° to be 
executed on the second processor. 


The GMS is I~simple. 
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Abstract -- The paper summarizes previous 
and contains new solutions for the problem to con- 
struct time-optimal schedules for a set of tasks 
to be executed on a two-processor system. We as-~ 
sume that arbitrary precedence rules for the exe- 
cution of the tasks and the tasks' computation 
times on any of the two processors are known in 
advance. 


1. Introduction 


Consider a finite, acyclic, weighted, direct-— 
ed graph, G (fawd-graph), with N nodes and E edges. 
G represents a task system; a node is a task, the 
weight of a node is the processing time of the 
task, and an edge (v,w) means that task v must be 
finished before task w can be started. 


The problem is to construct time-optimal (mi- 
nimal-length) schedules for a fawd-graph and for 
a system of two identical processors Reee [c], 
page 84). 


A well known special solution is obtained by 
the algorithm of Muntz/Coffman [MC] which computes 
a preemptive schedule in O(N%) steps. If we re- 
strict attention to a system in which all tasks 
require the same processing time the algorithm of 
Coffman/Graham [CG] computes a nonpreemptive sche- 
dule in O(Na(N) +E) steps where a(N) is an almost 
constant function of N [Se]. 


It is obvious that due to the various con- 
straints imposed on real task systems (being not 
considered in this paper) a set of schedules is 
much more desirable in general than a single sche- 
dule because in many cases at least one of the 
schedules from that set will be compatible with 
such constraints. That means: as soon as different 
and/or varying costfunctions are to be considered 
(i.e. reality shall be approximated) the approach 
to the problem via classes of time-optimal sche- 
dules [SchL] seems to be adequate. 


Predicates provide the best description for 
the class of all time-optimal schedules. Being true 
during processing of a fawd-graph on a multipro- 
cessor system these predicates guarantee the time 
optimality of the respective schedule. By this 
method for a certain type of task systems (charac~ 


terized by: arbitrary task lengths, precedence 
structure being a forest or an antiforest, arbi- 
trary number of processors, preemptions are al- 
lowed) one of the authors [Sch1] constructed not 
only one special solution but the class of all 
time-optimal schedules, i.e. the general solution 
of this problem. Due to this general solution the 
authors [HSS] were able to derive a fast sched- 
uling algorithm (where the number of steps is lin- 
ear in N). 


The result of this paper is 


~ a scheduling algorithm scheme which de- 
scribes (by means of efficient algorithms) the 
set of all time-optimal preemptive schedules for 
a fawd-graph on a two-processor system (i.e. the 
general solution of this problem) 64) 

- a special time- pope ime) scheduling algor- 
ithm of complexity O(N2) (i.e. the same complexi- 
ty as the Muntz/Coffman algorithm) generating 
schedules with at most 3N Pre cmpes one (whereas 
Muntz/Coffman's schedule may have 0(N2) preemp- 
tions). 


A short report about preliminary efforts to 
obtain these results is given in [SchS]. 


Basically, the scheduling scheme consists of 
two efficient algorithms to be applied repeatedly 
to G until it is completely scheduled: 

- The first algorithm, Al, computes the set 
of all admissible first assignments, AA. That 
means: Al computes the set, AA, of all those assign- 
ments the tasks of which can be executed by the 
two processors for some time t, t > 0, without loss 
of optimality of the whole schedule. 

- For an arbitrary admissible first assign- 
ment X the second algorithm, A2, computes its ma- 
ximal running time, Chas That means: after select- 
ing aribtrarily any assignment X from AA, A2 com- 
putes a time tx. such that the tasks of X can be 
processed for time t, O<t&s rX is without loss 
of optimality of the whole schedule and ae is the 
largest such number. 


The correctness of the algorithm is proved. 
A full version of this paper including all proofs 
will appear [St]. 


(a) 


Note that we do not talk about the set of all 
time-optimal scheduling algorithms but about the 
set of all time-optimal schedules. 


2. The class of all time-optimal schedules 


2.1 Notions and definitions 
Let G be a finite, acyclic, weighted, direct- 
ed graph, (fawd-graph G), where weights belong to 


the nodes. The length of a path of G is defined as 
the sum of the weights of the nodes on this path. 
The height H(G) of the graph is defined by the 
length of a longest path in G. |G| denotes the sum 
of all weights. 


Each node represents a task, T, the weight of 
a node represents the length 2(T (T) of task T. The 
directed edges between the nodes represent the 


precedence relations > between the tasks. The set 


of all tasks is denoted by T. A fawd graph is call- 


ed task-graph. 


The two processors are able to process tasks 
with equal, constant, positive, finite speed, i.e. 
the length of a task being processed is reduced by 
one unit of length per one unit of time. If a task 
is reduced to length zero, it is deleted from the 
graph. A task may be executed if it has no prede- 
cessors. 
rupted before the length of the task has been re- 
duced to zero. This is called a preemption. 
xX <1 a) is called an assignment if all tasks 
of X are processed simultaneously for some time, 
t“, the running time of X. 


A schedule S for G is determined by a finite 
sequence of assignments X and the respective run- 
ning times.(b 
denoted by top¢(G). In an optimal schedule all 
assignments ore called admissible. 
ning time of an admissible assignment, 


e « s xX 
maximal running time, thax. 


In this paper the graph G is drawn in the 
so-called stripe-representation D, using a car- 
tesian coordinate system. (See figures la - Ic.) 
A task T is represented by a vertical bold line 
of the length £(T) which might be partitioned in- 
to several parts connected by descending dashed 
arcs (see figure Ic). 
pressed by a non~ascending dashed arc from the 
bottom of the representation of T to the top of 
the representation of T'. Arcs may be omitted in 
cases as (T,,T,) in figure Ib. Two simple repre- 
sentations are Dt and Dh, where all tasks are po- 
sitioned as low as possible and as high as possi- 
ble , respectively, in the interval [0,H(G)]. The 
horizontal line through the top (bottom) of a 
task of G in representation D is called start- 
line (end-line) of this task. Any horizontal line 


is called height-line. 


The load density, ld, of G in D between 
two height-lines is defined to be the sum of the 
lengths of parts of tasks between these height-— 
lines divided through the distance between these 
height-lines. If G has a representation such that 
its ld = 2 everywhere in [0,H(G)], 


(a) 


underscores denote sets 


A processor executing a task can be inter- 


This shortest sum of running times is 


The longest run- 
X, 1s called 


If T precedes T' this is ex- 


then G is called 
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adjustable and this representation is called 
adjusted representation (or a-representation). 


Let us denote by L(G,D) the length of the 
union of all intervals in [0,H(G)] in which 
ld < 2. Let LMING:= min {L(G,D)| D is represen- 
tation of G in [0,H(G)]} . Then obviously 

l l 

t opt (5) = 5 (Gl - LMING) + LMING = Z(G + LMING) 

Finally we sometimes extend G (without 
changing notation) by a so-called zero task, which 
is unrelated to any task in G and which has 
length LMING; Then topt (G) is not changed but G 
is adjustable. 


The subsequent solution is derived from 
the principle - proved in [St] - that a schedule 
for an adjustable task~graph is optimal 1ff at 
any point in time the remaining graph is adjust- 
able. 7 


2.2 Algorithms 


The scheduling algorithm scheme we are 
going to investigate is of the structure represent~ 
ed graphically in figure 2. From this scheme a 
scheduling algorithm is obtained by assigning an 
interpretation to the starred lines. The scheme 
consists mainly of four parts 

~ checking whether G is ate eeusie 

- determining the set of all admissible 
first assignments, AA 

- determining the chosen assignment's, 
X€ AA, maximal running time, thax 

- updating G's representation such that 
this step of execution is displayed. 


The first three parts are based on the sub- 
sequent algorithm AO and — slight modifications of 
it. AO starts from G in D) and a partitioning of 
Db into p levels, defined by the end-lines of the 
tasks of any chosen longest path. We try to 
adjust G level by level from the bottom (level 
number = 1) to the top (level number = p). As we 
have on any level the task of the longest path 
chosen we.only have to check whether we have with- 
in the level to be adjusted additional tasks for 
adjusting this level (we call this level a-level); 
if there are not enough such tasks, the next 
higher level(s) is (are) considered (we call this 
level c-level) for a task to be moved down in or- 
der to adjust the a-level. This changes G's re- 
presentation. Tasks of G in the current represen~ 
tation which may be moved down to the a-level 
without violating the precedence constraints are 
called a-candidates. We finally denote by ald the 
load density of the a-level in the current repre- 
sentation of G. 


(b). is called time-optimal or optimal if there 


is no other schedule with a shorter sum of 
running times. 


Figure la: 


Figure lb: 
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Traditional representation 
of a fawd-graph 


end~-line of T, 


_.. Start-line of T, 


Stripe-representation 
(L-representation) where 
start- and end-lines are only 
drawn for T;. The others are 
omitted in order to simplify 
the representation. 


= 
\ 


Stripe-representation 
Note: H(G) does not depend on 
G's representation. 
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Theorem 1]: 


Let G be a task graph. Then the subsequent algor- 
ithm AO computes LMING 


AO: 
begin 
input G in D); LMING:= 0 
determine the p-levels of any longest path 
for a-level from 1 step 1 until p 
do c~level:= a-level + 1 
while ald < 2 A c-level < p 
do while ald < 2 A 4 a-candidate on c-level 
do move at most ((2- ald)* length of a-level) 
units of length of any a-candidate from 
the c~level down to the a-level 
od 
c~level:= c-level + 1! 
od 
LMING:= LMING+ (2- ald) * length of a-level 
od 
ouptput LMING 
end AO. 


end Theorem |! 
Remember that with LMING we know topt (G); too. 


For the calculation of the set of all admis- 
sible first assignments, AA, we use a modification 
of the algorithm AO denoted by Al. 


Theorem 2: 


Let G be an adjustable task graph in pe and let a 
longest path (defining p levels) be chosen; let TP 
denote its highest task. Then the set of all ad- 
missible first assignments, AA, is obtained by 
applying algorithm Al. Al computes a special a- 
representation D*" of G. Let p-set denote the set 
of all tasks without predecessors in level p of G 
in D“’. Let pld denote the load-density of level p. 
Then AA is defined as follows 


AA:= {(T,T')|T = if pld = 2 
then TP 
else arbitrary from p-set, 
T'= arbitrary from p-set} 


Al is defined as follows. 


Al: 

begin h 

input G in D 

determine the p-levels of any longest path © 

for a-level from 1 step |! until p | 

do while ald < 2 
do 
determine E and ES, E < ES, where E is the 
height of the lowest end-line of a-candidates 
and ES is the next higher such end- or start- 
line, and determine k, the number of a-candida- 
tes with end-line E. Move y units of length of 
each of these k a-candidates down to the a~ 
level, where 

y = min {(2- ald) * length of a-level/k, ES-E} 

od ; 

od 

output AA 

end Al 

= end Theorem 2 


For a task graph G the next theorem gives the 
maximal running time t,,, of an arbitrarily chosen 
admissible assignment X € AA. 


Theorem 3: 


Let G be an adjustable task graph and let 

X:= (T,T') € AA be an arbitrarily chosen admissible 
first assignment. Let G* denote the task graph 
obtained from G by reducing ve lengths of T and 

T' by min maar ,£(T')}. Let D* denote the represen- 
tation D)for GX. Let the representation D' for G 

be defined such that all tasks of GX are located 

as in D“ and T and T' are located on top of D&. 

By applying A2 to G in D' we obtain vee 

A2 is defined as follows. 


A2: 

begin 

input G in D' x 
determine the p levels of any longest path in G 
for a-level from ! step 1 until p 


do 
see Al 
od 
tx t= minimum of Bae pieces of T and T' still 
beyond of H(G* ) 
end A2 


end Theorem 3 


The next theorem is the main result of the 
paper; it makes use of theorems 1-3. 


Theorem 4: 


Let SA(G) denote the set of scheduling algorithms 
for a task graph G obtained from the scheduling 
algorithm scheme, shown in figure 2, by all deter- 
ministic interpretations of the starred lines. Let 
S(G) denote the set of all schedules for G obtain- 
ed by applying all SA € SA(G) to G. Then S(G) is 
the set of all optimal schedules for G. 


end Theorem 4 


Unfortunately the approach taken to derive 
this general solution is of no help if more than 
two processors are to be scheduled. But as present-— 
ly two processor systems are of great technical 
importance and no general solution for m-processor 
systems, m > 2, can be expected, a separate in- 
vestigation of this special case is surely justi- 
fied. 


2.3 The computational complexity of the algorithms 


Let G have N tasks and E arcs. Then the input 
procedure has the a O(N+E). See figure 3 
for an overview. 


For computing LMING, the set of all admissible 
first assignments and the maximal’ running time for 
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Input: G, t:= O, S:= @ 


ue. 


~ Compute LMING ana extend G ae a 
Zero task of tength: LMING 


, eee! areas ee ene ere ere 


Compute the set of all admissible 
first assignments, AA(Gt) 


max 


Choose any X€ eee) (*) 
Camsutexthe. magus untae time 

t% of x 

max 

Choose any time tA, O< th < tx (*) 


ad SM alee tte h. ss 
ldpdate the representation or Ge 
after the lengths of the tasks 
assigned in X have been reduced 
by tX, 


Extend the schedule obtained so far,i.e. 
S:= S$ concat (Zt); t= t+ PES 


yi 


DRS 
ee a “Are. all ae of G x 
MU \ completely scheduled? /“ 


toe See: ~ _. YES 
Output: S(G), topt (G) = t 


Mw 


a 
, 


( END 


a 


Figure 2: Scheduling algorithm scheme 


G, here denotes the part of G 
not yet scheduled at time t. 


a chosen admissible assignment the algorithm AO 
and its modifications, respectively, are used. 
All three algorithms have the complexity O(N2). 
Because of their similarity it suffices to con- 
sider AO in order to obtain this bound. 


Applying AO basically requires two steps: 

1. Establishing the appropriate initial re- 
presentation for execution of AO, i.e. bringing 
G_ into Dh . This requires O(N+E) steps. 

2. Execution of AO, requiring O(N2) steps. 
This low bound seems to be achieveable, observing 
that each of the N tasks can be cut in N parts at 
most; it actually is achieveable by an appropriate 
implementation as shown in [St]. 


The updatings of Gt, S and t obviously can 
be done in O(N) steps. 


If always the largest running time, ee is 
chosen, then the number of assignments is bounded 
by O(N2) because no assignment can occur twice. 
Thus the complexity of the output procedure is 


O(N), too. 


If moreover the choice of an X € AA(G+) is 
always done in at most O(N2) steps then the com- 
plexity of the scheduling algorithm (obtained by 
this interpretation of the scheduling algorithm 
scheme) is bounded by O(N4). 


BEGIN 
aes ee 
Input G O(N+E) 
_Compute LMING (by AO) and O(N) 
-add zero task, if required 
a 
ema 
‘eonpuee AA(G, ) (by Al) 0(N*) 
Choose X € AAC, ) (by caX) O(N*) 
X 2 4 
Compute Ane (by A2) O(N”) O(N’) 
7 
Various updatings O(N) 
eeigued Fhe all tasks of G 
completely executed? 
YES 
ee oe 
Output schedule 0(N*) 


Figure 3 Computational complexity of any sche- 
duling algorithm derived from figure 2 
by interpreting the starred lines as 
follows: t%:= t% and for determining 
an X € AA(G;) we have some choice al- 
gorithm, caX, of complexity O(N“), at 
most. 


3% An efficient scheduling algorithm 


In this section we give a scheduling algorithn, 
LNP, generating schedules with a low number of pre- 


emptions. LNP has the computational complexity 
O(N), 1.e. in the general aus it is not faster 
than the algorithm with 0 (NZ ) steps given by 
Muntz/Coffman(@), But the total number of preemp- 
(a) 


For the special case f(T) = 1 for all TET, 
i.e. all tasks have the same length, LNP is of 
complexity O(N+E), if we have a computer with 
multi-level indirect addressing [T, p.84], like 
e.g. the PDP-10, or an instruction determining 
the number of leading zeros in a word [H,p.245], 
or something similar. 
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_algorithm, e.g. 


tions of schedules generated by LNP is bounded by 
3N, whereas the M/C algorithm may generate O(N2) 
preemptions, as can easily be seen [Sch2]. 


It is quite interesting to see that the LNP 
scheduling algorithm is not derived from the 
scheduling algorithm scheme; although the sche- 
dules generated by the LNP algorithm may be gener- 
ated by the scheduling algorithm obtained by a 
suitable interpretation of the scheduling scheme. 
Nevertheless it probably would not have been pos- 
sible to construct the LNP algorithm and it is 
hard to see how to prove its correctness without 
the analysis required for the scheduling algorithm 
scheme. 


The algorithm LNP starts from an adjustable 
G in De. It begins with the highest tasks of G 
in De and procedes to the lowest tasks. At any 
time t during scheduling G we denote by G; the 
part of G not yet scheduled. The ehh Nara 
introduced subsequently all refer to Gy in D*, 
unless stated otherwise. Let HEL denote the 
highest end-line HEL and let HEL-tasks denote the 
set of all tasks with start-lines higher than HEL. 
Let T* be a task with the lowest end-line of all 
tasks in HEL-tasks and let TY be a task with 
start-line H(G,). HEL-tasks' a obtained from 
HEL-tasks by removing Te and TH from it. Let T 
be a task witha lowest end-line of all tasks in 
HEL-tasks', if it is notempty. HEL-tasks" is ob- 
tained from HEL-tasks' by removing T** from it. 


Let the initial current representation, p° 
of G.> be defined such that 

- all tasks not in HEL-tasks' 
as in D 

- the end-lines of the tasks of HEL-tasks" 
have the height HEL 374 

- a piece of T** of length z i 
beyond HEL, the remaining piece of T 
cated as in » where 


are located 


located 
being lo- 


min ( 2(7T&), sum of the lengths of the pieces 
beyond HEL of all tasks of 
HEL~tasks \ ¢ Tee } of G_ in DS}, 


Z:= 


Let the current inadjustment of Ge in D© be 
defined as 


CIA:= 2* (H(Gt) - HEL) - sum of the lengths of the 
pieces beyond HEL of all tasks of G; in D®, 


As long as CIA > 0 we must change D© once more by 
moving another task (or a piece of it) up into a 
position beyond HEL. For this purpose we take a 
task with the highest end-line of all tasks of Gt 
in D© starting not above HEL, and which may be 
moved up beyond HEL without violating the prece- 
dence rules in G (this may imply moving up a 
task, which is located beyond HEL, until its 
start-line becomes H(G;)). Let this latter task 
(required for reducing the current inadjustment 
of the part of G in De beyond HEL) be denoted by 
flee 


As soon as CIA becomes zero, the pieces of 
the tasks beyond HEL may be scheduled by a simple 
the "packing" algorithm from 


ha 
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Figure 4: An example of the application of the algorithm LNP. 
Note the following pecularities of this example 


the highest and the lowest level cannot be adjusted 


the second lowest level has a current inadjustment 
which can be reduced to zero 


in the second highest level the task with the second 
lowest end-line (Tj5 = tll) is taken only partially; 
moreover, TA* = TH, 
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[C], page 76. 


Note that obviously the computation of HEL and 
the associated adjusting takes place at most N 
times and that each time at most 3 preemptions are 
required. 


Theorem 5: 


Let G be a task graph in Do. Then the application 
of the algorithm LNP (defined below) generates an 
optimal schedule, S, for G in at most 0(N2) steps 
and S contains at most 3N preemptions. 

The algorithm LNP is defined as follows. 


Input G in pé 
G not completely scheduled 
G, determine 
HEL, HEL-tasks", TH, 72, 722, 

move the tasks from HEL-tasks" and a piece of 
Tle up beyond HEL such that Gt is brought into 
its initial D° 
while CIA>OA 
do determine T° 


locate a piece of T° of length min {£(TC) ,CIA} 
beyond HEL 


= i he 


od 
schedule the pieces of Gt in D© beyond HEL 
by applying the packing algorithm 

od 

output § 

end LNP 


end Theorem 5 


Because of its similarity to the algorithm AO 
the algorithm LNP terminates after O(N2) steps. 
Figure 4 gives an example of the application of the 
algorithm LNP to a G such that the various cases 
to be considered do occur. 
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ANALYSIS OF STRUCTURES FOR PACKET COMMUNICATION* 


Robert G. Jacobsen 
David P. Misunas 
Laboratory for Computer Science 
Massachusetts Institute of Technology 
Cambridge, Massachusetts 02139 


Abstract -- In a system utilizing packet 
communication techniques of message transmission, 
all communication between the units comprising the 
system is through discrete blocks of information 
conveyed in packets. Interconnection structures 
in such systems can range from bus and crossbar 
structures to complex routing networks. A 
comparative analysis of a number of 
interconnection structures for packet 
communication systems is presented and tradeoffs 
between the various structures in terms of cost 
and performance are analytically examined. 


Introduction 


The increasing popularity of multiprocessor 
systems and the corresponding necessity for 
efficient interprocessor communication means has 
spurred the study and development of communication 
paths for use in such systems. One means for 
interprocessor communication which is gaining 
popularity is that of packet communication. In a 
system with packet communication architecture, the 
units comprising the system communicate through 
the transmission of discrete information packets 
[2]. 


Classical approaches to the design of 
communication paths have included such structures 
as busses and crossbar switching networks. These 
structures are necessarily small, due to the small 
number of interconnected units and due to the 
speed requirements placed on the structure. As 
the number of interconnected units increases, 
these structures become cumbersome both in size 
and processing capability. 


More recently, a new interconnection 
structure, the routing network, has been presented 
and used in the design of a new type of parallel 
computer [3]. This structure is capable of 
simultaneously conveying many packets to their 
destinations in the processor and has a slower 
growth rate than the crossbar structure. 


*This research was supported by the National 
science Foundation under grant DCR75-04060 and by 
the Advanced Research Projects Agency of the 
Department of Defense, monitored by the Office of 
Naval Research under contract number N00014-75-C- 
06661. 
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The tradeoffs between the various 
interconnection structures are not clearly 
understood. In the case of the routing network, 
little analysis has been performed at all. 
Detailed studies have examined such structures as 
the bus and crossbar [5, 9]. Some network 
structures have been studied [1, 8], particularly 
in the context of telephone switching networks (6, 
7, 8). However, these studies have generally 
considered only fixed connection circuits, rather 
than packet switching circuitry. 


In the analysis of the present paper, we 
examine the characteristics of three communication 
structures: the bus, the crossbar, and the 
routing network. The cost and performance of each 
structure is analyzed to yield results as to the 
various tradeoffs involved in the choice of one 
structure over another. The analysis of these 
interconnection structures is supported through 
simulation results obtained on a packet 
communication simulation facility. 


System Architecture 


The design of a system interconnection 
structure is a difficult and poorly-understood 
problem, generally relying heavily on the 
experience of the system architect. There are no 
rules or guidelines for one to follow in such an 
exercise, merely a few general philosophies. In 
the following paragraphs, we will examine this 
situation more closely in the context of a packet. 
communication systen. 


A packet communication system generally has 
some structure similar to that shown in Figure 1. 
The units comprising the User of Figure 1 may be 
processors, memories, functional units, or any 
other devices capable of message transmission or 
reception. The Communication Network of the 
system provides a path between the various units 
of the User. This interconnection structure may 
provide a path from every unit to every other 
unit, from groups of units to groups of units, or 
from each unit to one or several of the others. 
For the purposes of this discussion, we will 
assume the most general case; that is, every unit 
of the User can communicate with every other unit 
through the Communication Network. Other 
interconnection schemes can be considered as being 
composed of a number of embodiments of this more 
general case. 


Communication 
Network 


Figure |. 


System Structure 


Presumably, the designer of a packet 
communication system has an application area in 
mind for the system and has some idea of the 
amount of traffic which will pass over the 
communication mediun. Thus, through some 
analysis, one should be able to generate a curve 
corresponding to the solid line of Figure 2. Such 
a user load curve expresses the number of packets 
generated as a function of the time required for 
an individual packet to transit the communication 
network and should always have a non-positive 
derivative, indicating that interunit 
communication will generally occur less frequently 
as the communication times increase. 


On the other hand, the dashed curve of Figure 
2 represents the load characteristics of the 
Communication Network and always has a non- 
negative derivative. The slope of the 
Communication Network load curve demonstrates that 
the load on the communication medium increases, 
the delay through the medium should eventually 
increase. 


ij 
/ 
/ 
/ 
Packet y, 
Traffic vi 
Operating 
Point 
Network Transit Time 
Figure 2. System Operating Characteristics 
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Generally, the two curves intersect at a 
point which will be the operating point of the 
system. Clearly, the system is only stable at the 
operating point and any digression from that point 
is countered by forces which tend to return the 
packet flow to the operating point. 


Were it possible to empirically derive the 
User and Communication Network curves of Figure 2, 
the analysis and synthesis of packet communication 
systems would be greatly simplified. If there 
existed curves for the various types of 
interconnection structures, a designer need only 
develop the characteristic curve of his proposed 
User structure, choose a desired operating point 
on that curve, and match the appropriate 
Communication Network curve to yield the best 
cost/performance at that operating point. 


Such a scheme may seem impractical, however, 
methods similar to this have been derived for many 
other branches of engineering, and there is no 
explicit reason why it is not possible to do so 
for aspects of computer design. 


The remainder of this paper describes some 
preliminary results which were achieved while 
trying to generate load curves for various 
Communication Network structures. Whereas the 
achieved results do not yield rules for processor 
design, they provide a first step in that 
direction through the analysis of packet flow in 
the structures 


Network Representation 


The communication networks of the present 
study are formed of arbitration units and switch 
units. Each arbitration unit accepts the first 
packet to arrive at any input and passes the 
accepted packet to its output. In the case of 
conflict, one packet is arbitrarily selected and 
passed to the output before the other(s). Each 
switch unit transfers a packet on its input to one 
to its outputs, generally controlled by some 
switching specification contained in the packet. 


The bus module of Figure 3 comprises an 
arbitration unit followed by a switch unit. 
Similarly, models for a crossbar and a routing 
network are shown in Figures 4 and 5. A network 
such as that of Figure 4 which is composed 
initially of switch units followed by arbitration 
units is called a distribution network, and a 
crossbar is one configuration of such a network. 
Similarly, a network which contains an initial 
stage of arbitration as that of Figure 5 is called 
an arbitration network. 


The networks under study are structured as a 
number of stages connected in sequence. Each 
stage of a network is composed exclusively of 
either arbitration or switch units and is 
characterized by the log to the base N of the 
fanout/fanin ratio: 


(Number of Outputs) 
lo¢y 
(Number of Inputs) 


Arbitration 
Unit 


Figure 3. Structure of a Bus 


This means of characterization has been chosen for 
two reasons. First, the size of the individual 
arbitration and switch units comprising each stage 
is clearly specified. Second, such a 
characterization represents a constant network 
architecture, regardless of the number of inputs 
and outputs. 


The bus structure of Figure 3 (and all bus 
structures) is characterized by (-1, 1). 
Similarly, all crossbar structures are 
characterized by (1, -1). The "square-root" 
arbitration network of Figure 5 has the 
characterization (-1/2, 1/2, -1/2, 1/2). 


Note that for an NxN communication network, 
the sum of all numbers in the network 
characterization must be equal to 0. Furthermore, 
in order for every input of a network to be able 
to communicate with every output, the sum of the 
absolute values of the numbers comprising the 
network characterization must be at least two. If 
the sum is greater than two, the network contains 
redundant paths. 


Figure 4. Structure of a Crossbar 
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Figure 5. Structure of a Routing Network 


At this point, we shall further restrict the 
networks under analysis to constant geometry NxN 
communication networks which can be characterized 


by a positive integer fraction f, where the 
network characterization is (-f, f, -f, f, ...) 
for an arbitration network or (f, -f, f, -f, ...) 


for a distribution network. The number of 
occurrences of f in each characterization is equal 
to the number of stages in the network, that is, 
to 2/f. Bus structures, crossbar structures, and 
Simple power networks are examples of networks 
with such a characterization. 


This restriction does not necessarily 
preclude the consideration in our model of 
networks which do not have alternating stages of 
arbitration and switch units. Without loss of 
generality, adjacent stages of the same type can 
be considered as one stage with a characterization 
which is equal to the sum of the characterizations 
of the two stages. However, the model described 
herein is only applicable to networks which can be 
characterized by a constant fraction f once 
reduction of identical adjacent stages has been 
performed. 


Performance Analysis 


For the purposes of finding the 
Characteristic curve of a communication network, 
we need to make two simplifying assumptions. 
First, we consider the cost of a device 
proportional to the speed of the device times the 
number of wires connected to it. This assumption 
is not precisely accurate, but close enough for 
the purposes of this discussion. 


second, we assume that the packet 
distribution on the inputs of a communication 
network is even and Poisson and the distribution 
through any cross section of the network is even. 


The communication networks under study are 
composed of an interconnection of one basic unit 
type, called a tie and consisting of an 
arbitration unit and a switch unit. The bus of 
Figure 3 is composed of one such tie. The network 
of Figure 5 can readily be seen to comprise a 
number of ties. Although the topology of a 
distribution network is slightly different than 
that of the networks in Figures 3 and 5, such a 
structure can be analyzed in a similar fashion. 


We wish to examine two variables within each 
communication structure, a delay derater D and a 
loading representation F. D represents the 
average transit time for the network divided by 
the minimum transit time and can assume values 
ranging from one to infinity. D=1 signifies that 
the transit time through the communication network 
is only the hardware delay, whereas larger values 
of D indicate the presence of conflict in the 
structure. 


F represents the fraction of the network that 
is not is use, that is, the free capacity of the 
network divided by the total capacity. In the 
following study, we examine D as a function of F 
to achieve each network characterization. The 
communication network load curve of Figure 2 
represents a graphical depiction of a function 
Similar to (1-F) vs. D. We have made this 
modification to the axis of the graph for the 
purposes of simplifying the analysis and the 
involved mathematics. 


Representing the interarrival time on each 
input of an n-input tie by I and the service time 
by T, we find that a packet will arrive every I/n 
and hence: 


Frje 2 1 - aT/I 


Generalizing to all the units of a stage, a packet 
can be transmitted to the next stage at most every 


T(n/N) = TONE/N) = TING -f), Thus: 


F 21 - (T/N|-£))/ Cyn) 


stage 
= 1- NTI 

Since all stages in this type of network are 

similary constructed: 


F = F 2 1- Nr] 


network stage 


The application of queueing theory techniques 
to the performance analysis of one tie, 
considering each tie as a queue and assuming 
Poisson arrival rates, yields the result: 


D=1+ (1-F)/4F 


All ties in the network operate at the same 
F. Hence, overall, we can say: 


D z= i+ (1-F 


network network) /4Fnetwork 


Simulation Results 


Utilizing a packet communication simulation 
facility, a number of bus, crossbar, and routing 
network structures were simulated to see if actual 
performance followed the D = 1 + (1-F)/4F formula. 
The simulation results are depicted in Figure 6. 


The solid line of Figure 6 represents the 
graph of D = 1 + (1-F)/4F, and the points 
resulting from the simulation appear to observe 
this characteristic for the three structures under 
study. 


The simulation modelled each network input as 
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Simulation Results 


Figure 6. 


an independent source with a Poisson distribution 
and given interarrival time. The discrepencies of 
the simulation from the model for small values of 
F are due to the fact that the model contained 
infinite queues between the sources and the input 
ports, whereas such is impractical in the 
Simulation, eventually causing the input queues to 
back up and affect operation of the sources. 


Network Selection 


The cost analysis for an arbitration network 
such as that of Figure 5 can be represented as 
follows, where Cay is the cost of the network: 


Can = (number of stages) (cost of each stage) 
= (1/f) (speed * number of wires) 
= (1/£) (INE/£) aN) 
= NCL+E) 762 


In this case, speed is equal to nf /¢£ to maintain a 
constant average delay through the network with 


changes in f. The term ni compensates for the 
increased loading of arbitration units due to the 


compression by Nt, The 1/f arises from the need 
for each stage to operate faster in networks with 
more stages. 


In the case of a distribution network: 


= (number of stages) (cost of each stage) 
= (1/f) (speed # number of wires) 
(1/£) (1/£) # (N“+#)) 


Con 


A distribution network has a greater number of 
wires because each input wire of a stage of such a 


network is expanded to Nl+#) wires. Due to this 


expansion, the component speed in a distribution 
network is only affected by the number of stages, 
that is, by 1/f. 


Thus the linear cost assumption has led us to 
the conclusion that for some fixed performance, 
the arbitration network of Figure 5 costs the same 
as the distribution network of Figure 7. This 
result is non-intuitive at first, however, 
consider an arbitration network of complexity N. 
The units comprising this network have speed N due 
to the initial compression factor. The complexity 
of an equivalent distribution network is N2, but 
the additional parallelism allows the network to 
be constructed of components with speed 1. Hence, 
the cost of the two networks is equivalent. 


The minimum of the network cost N‘1+f) ¢2 
occurs at 


1/f = (1/2) In N 


where 1/f is the number of stages. Hence, for the 
linear cost assumption of the model, the following 
structures are best suited for the specified 
number of inputs for either arbitration or 
distribution network: 


N Structure 


7 i-stage networks 
(bus and crossbar) 


50 2-stage networks 
400 3-stage networks 
3000 4-stage networks 


An interesting result which arises from the 
performance computations is the determination of 
the optimal value of n, that is, the number of 
inputs to each arbitration unit and outputs of 
each switch unit. As we have seen, the minimum 
cost occurs when f = 2/iIn N. Thus, these 
expansion and compression ratios should be: 


wf - ye/in N . Q2 x 7 


To utilize the previously described results 
in the design of a packet communication systen, 
one first determines the load curve of the units 
to be interconnected. The architecture of the 
communication network utilized in the system is 
specified by the number of units. With these 
specifications in mind, there are a number of 
design choices which can be made. 


The load curves of the communication network 
consist of a family of curves which are parametric 
with cost. To design for a specific cost or 
technology, the intersection of that member of the 
family with the user load curve yields the 
performance which can be achieved. 


Conversely, to structure the system for a 
specific performance, the desired operating point 
on the user curve is specified and the network 
curve which passes through that point determines 
the cost and speed necessary in the component 
parts. 
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Figure 7. Structure of a 


Distribution Network 


The choice of either an arbitration network 
or a distribution network must take into account 
important factors such as the available 
technologies. While these factors are not 
included in the model, they will dictate actual 
use of any results achieved therefron. 


Concluding Remarks 


This attempt to probe the interconnection 
problem for packet communication systems has left 
many questions unanswered. The model utilized has 
a number of deficiencies and remains to be made 
more exact and extended to structures other than 
certain NxN power networks, such as asymmetric 
networks and concentration networks. Further 
refinement of the model and addition of other. 
structures should provide much information useful 
in the synthesis of processor structures for 
packet communication. Despite its deficiencies, 
the model provides a first attempt to analyze such 
packet communication interconnection structures 
and yields some interesting insights into their 
behavior. 
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Abstract -- In the paper, the concept of data 
structure architectures is developed as a solution 
to the problem of providing increased hardware 
support for the basic.task of computing, viz. the 
creation and processing of data structures. As a 
starting point, a uniform algebraic description of 
data structures is presented. Consequently, the 
necessity for a management of the two fundamental 
types of data entities, ordered sets and general 
sets, is recognized. In order to allow a machine 
to handle the various data structures by a stan- 
dardized hardware, an intermediate data structure, 
called the basis and managed by hardware, is 
introduced. The programmer creates arbitrary data 
structures in terms of basis elements which are, 
in turn, mapped by the hardware onto consecutive 
storage. The processing of basis elements in res- 
ponse to a single machine instruction is based on 
the referencing of basis element descriptors and 
implemented by pipelined processors. © 


Keywords: computer architecture, performance 
architecture, general-purpose computing, data 
structures, ordered sets, data model, descriptor- 
referenced allocation, tagged architectures, hard- 
ware execution. 


1. Introduction 


The basic organizational concept of most 
computers presently being used or being marketed 
is still the 30 years old concept as developed by 
von Neumann, Burks, and Goldstine [1]. In our 
opinion, the reason for the so amazing longevity 
of the von Neumann principle is its unique com- 
bination of simplicity and flexibility. The von 
Neumann concept may be epitomized as a concept of 
minimal hardware resources: The basic von Neumann 
machine encompasses one central processing unit, 
one main memory, and one input/output channel. 


This concept of hardware minimality, which 
was perfectly adequate at a time when the hard- 
ware of a computer was the major cost factor, has 
meanwhile turned into the major factor that will 
obsolete the von Neumann architecture. In the 
age of dramatically decreasing cost of standard- 
ized LSI componentry, concepts are needed which 
allow increased hardware expenditures in order to 
achieve certain design objectives such as an in- 
crease in performance or availability or both. 
Such a multiplication of hardware resources 
implies the abolishment of the most severe perfor- 
mance-limiting feature of the von Neumann machine, 
namely that it manipulates the content of only a 
single memory location at the time, in favor of 
the simultaneous accessing and processing of a 
set of values, i.e., in favor of parallel proces- 


sing. 
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Most of the existing parallel processing 
architectures, however, were initially designed 
for special purposes rather than for general- 
purpose computing, and many of these architectures 
do not lend themselves very well to a generaliza- 
tion. Whereas it is a rather straightforward 
task to design the architecture of a special pur- 
pose computer, based on a homogeneous class of 
algorithms (e.g., for solving partial differential 
equations or for processing a matrix of radar 
data), it is not possible to define such distin- 
guished classes of algorithms after which a com- 
puter architecture could be modelled if the uni- 
verse of all possible algorithms is considered. 
However, in the search for a class of architec- 
tures for general-purpose computing, i.e., arch- 
itectures which can really replace the von Neumann 
architecture, the whole domain of computing must 


‘be taken into account. 


The most general definition of computation is 
that of "a sequence of transformations which 
transform an initial representation through a 
sequence of intermediate representations into a 
final representation" [2]. A representation is a 
transforming function and its data. As it is not 
possible to identify patterns in the universe of 
all possible transforming functions which could 
render the blueprint for a class of general-pur- 
pose architectures, the only possibility left is 
the structuring of the data or, more precisely, 
the processing of appropriately structured data 
entities. 


In the von Neumann machine, data are totally 
unstructured, i.e., the only data entity of the 
machine is the scalar. In real-world computation, 
we find always structuring relationships between 
the data of a program which constitute the basis 
for data retrieval and processing. An architec- 
ture which supports the representation and pro- 
cessing of arbitrary data structures by hardware 
shall be called a data structure architecture 
(DSA). It need hardly be emphasized that a data 
structure architecture should be complete and 
minimal, i.e., it should allow for the representa 
tion of any desired structure, and it should 
employ for this purpose a minimal number of stan- 
dardized tools. 


2. A Formal Definition of Data Structures 


Knuth [3] defines data structure as "a table 
of data including structural relationships". 
Formalizing this, we define a data structure as a 
pair 

(S, p) 


where S = {sjs+++58 3 is a set of data objects 


and p = {Ryo+++sRI is a set of binary relations 


rae : RS SxS 


A eae ee Se 5 


such that 


By specifying certain properties of the 
relations in p, different structure types are 
obtained. These are basically the following four 
types [4]. 


(S,o) = (S,{sh)_ , 


where « denotes a relation that is reflexive, 
antisymmetric, and transitive, and satisfies the 
additional condition that for any two objects s., 


- €S at least one of the two propositions s, 3 


Thus <¢ denotes a linear 


(1) 


Ss or s is true. 


2 2 1 
ordering. This relation defines an ordered set 
{S[{1],...,S[n]} of data objects S[i] ¢S which are 
identified by an ordinal number specifying their 
relative position in the set. This structure is 
usually called a linear list. 


< Ss 


A simple generalization of a linear list is 
a two-dimensional or higher-dimensional array of 
data objects. In a rectangular two-dimensional 
m*Xn array we have the linear row lists (R,, 
{< ,}), de [Lim], with R, = {R,[1],.-.,R,[n]} 
5d. ab i i 


and the linear column lists (C, t, ;D> pce eevee 
> 
with c, = tC, {1],..-,C,[m]}. These linear lists 


are orthogonally connected such that the linear 
ordering of the row lists implies the same order- 
ing of the column lists and vice versa. Hence, 

a two-dimensional mxn array is defined by the 
pair (the definition can be easily extended to 
any arbitrary higher dimension) 


™ 
(S,po) — (OR {Seyperte <R,m’ Soper dg at): 


(S,;p) = (S,{q,s--++94 ) 4 (2) 


p-n 
where p = taj onetnd ug? is a set of relations that 


are reflexive, symmetric, and transitive, i.e., 
equivalence relations. The equivalence relation 
qh defines a partition P(S) = {Sjs-++55 } of the 


set S, and the remaining equivalence relations in 
o define refinements of this partition. If we 
assume that P(S) is refined until n singleton 
sets, {s },...,{s }, are obtained, each one con- 
taining exactly one of the elements of S, the 
result is a collection of nested sets [3] C = 
ee such that each equivalence relation 
qd, € defines a partition P(C.)<C, ke [l:p-n], 
whereas the remaining n sets in C are the elements 
of the set {{s,},-..,{s_}} = {C2227 2E,! (any 


partition generates a collection of nested sets; 

but not any collection of nested sets constitutes 
a partition). Such a data structure is called a 

tree. 


The definition of the collection of nested 
set C = Gio sansG does not imply an ordering of 
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the equivalence classes C;eP(C,), 1 < k < p-n, 
but only indicates the ancestor—descendant rela- 
tionship among the equivalence classes C, eC. 


Trees which are equivalent to a collection of 
nested sets are called oriented trees, since only 
the relative orientation of the nodes is being 
considered. An ordering of all equivalence 
classes Cc, P(C)) implies an ordering <¢ on the 
data objects s,€5. Thus, an ordered tree is 
defined by a pair 


}) . 
p-n 


(S,0) = (SARs RN 


(S,p) = (S,{54,2+++54 
(3) 


where the relations R, € 9 are defined in certain 
pre-defined subsets A,,B, CS, i.e., R, CA,X B, 
1 Oa: Mees ‘aD i 


¢S*xS, with the additional constraint for the 
range R, of a relation R, that , = Be (we call 


such a relation range-total). No constraint is 
given for the domain h;> that is A; = Aj Fur- 


thermore, we have 


S$ > A= err and S=B= LU 3, = 


1l<i<r l<i<r 


Un. 


l<i<r 


Adopting Knuth's terminology, we call such a data 
structure a List. 
The set op of relations R. = {(s,28,)/p,(s,, 
s,)} ©8,* 2 SAxS SxS may be defined by a 
set T = {Pp,s+++sPL} of propositions. For each 
relation Re EPs we call the elements aa le with 
A, c A. cs, the reference elements of R.- Then, 
a relation R, CA, XR, cA,~% S generates for each 
a,cA; a subset of ae that shall be denoted %,/a, 
(read: "the subset of R, with respect to a.)> 
such that 
K,/a, = {se R,/p,(a,»s)} 
LS hla; = 
a.€AJ. 
i 


relations R, cp define a set N = {R,/a,/ie [Lir]a 


A/a, = ~ if a, gd, and R;- The 


a,c As} such that {s,},...,{s ten. The nodes of 
a List represent the sets Rfa,eN. The defini- 


tion of some ad hoc ordering < on the data objects 
s,€8 implies an ordering of the nodes repre- 


senting the sets Rila, en in all sub-Lists of a 
List L which is defined by a pair 


(S,p) = (S,{<¢,R,5-+-sR 1) ° 


(S,p) = (S,{R,»++-sR_}) (4) 


where the relations R, €P are defined in subsets 


A,B © S such that 


SD A= U6, and S> B= U2, ana S = AUB. 


l<i<r l<i<r 
No constraints are imposed on the relations R, € pe 


We call such a data structure an associative 
structure. The elements a) eA cS are called 


domain elements and the elements be Bc S are 


called range elements. The set p of relations 
R, = {(a, »b,)/p, (a; »b,)} CA, XR, SAXB ESS 


is defined by a set 7 = {pjo+++>PL} of proposi- 


tions. Thus, the data structure under considera- 
tion may be specified by the triad (A,7,B) [5]. 


3. The Necessity of a Machine Data Model 


In order to store a data structure (S,0) = 
({s,5+++58 },{R,>+++>R}) in a computer memory, 


the information content of that data structure, 
i.e., the set S = {s)s-++-98 } of data objects and 


the set of structuring relations {R R} must 


ae ee 
be represented in an appropriate form. That is, 
a memory representation of a data structure must 


retain the set-element relationships defined by 
the relations RL € 9: The relations R <sxs 


define subsets S, <S which are represented by 


the nodes of the corresponding data structure 
(S,o). If all singletons {s,}, s, eS, are 


uniquely identified by the relations REP in 
connection with reference elements s,ceS, then 


the definition of a linear ordering of the data 
objects s,€5S implies an ordering of all the 


nodes of the corresponding data structure. Other- 
wise, the nodes of the corresponding data struc- 
ture represent (unordered) sets S, <S&S of data 


objects. Therefore, a data structure architec- 
ture must provide hardware support for the manage- 
ment of ordered sets and general sets, as well as 
an adequate set of operators defined on these 
fundamental types of data entities. 


Physical memory can be either location- 
addressable or content-addressable (associative). 
Hardware-associative memory is ruled out for two 
reasons: Firstly, its cost is prohibitive and, 
secondly, it is not needed, as will be shown sub- 
sequently, if the purpose is to store and access 
structured sets of data rather than unstructured, 
general sets. In the case of location-addressed 
memory, the most fundamental mode of storing the 
data items of a data structure is the consecutive 
storage in the form of a data vector. The alge- 
braic definition of a data structure can here be 
substituted by the "semantic" definition 


<data structure> = 
(<data vector>,<structure specification>) . 
In the von Neumann machine, the mapping from 


a data structure to its data vector is performed 
(by software) in one step. However, such a 
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mapping can be greatly facilitated if the data 
structure is, in a first step, mapped onto an 
appropriate ‘intermediate’ data structure which, 
in turn, is then mapped in a second step onto the 
data vector. We call the first mapping a struc- 
ture definition and the second mapping an addres- 
sing function. The advantage of this approach 
lies in the fact that a standardized intermediate 
structure can be found that is necessary and 
sufficient for the representation of all data 
structures defined in section 2, whereas the data 
vector represents only the data of those struc- 
tures. 


Let Ndenote the set of non-negative integers 
and let M be the set of memory addresses. A data 
vector that is physically represented by sequen- 
tial memory location is defined by the mapping 


v: N>y M . 


Let B be a set of r-dimensionally ordered sets, 


i.e., an element of B is defined by 


o: INT SON. 


We call B the basis of the data structure archi- 
tecture. The positions in the r-tuples (nyo--+> 


Yr ; 
nde N are called the coordinates of the r- 


dimensionally ordered set, and r is called its 
rank (dimensionality). An element of IN® is 
called an index r-tuple. The index r-tuples are 
unique identifiers of the elements of an r- 
dimensionally ordered set, as the function o maps 
index r-tuples into indices which specify the 
relative position of the identified element in 
the data vector. Hence, the mapping from the 
thus defined basis into a physical data vector, 
based on a sequential memory allocation, is 
accomplished by a composition of the functions o 
and v into a function 


r 


a: N ->M 


which we call the addressing function. 


Sequential allocation is characterized by 
the linear ordering of the memory locations. 
addressing function for sequentially allocated 
r-dimensionally ordered sets is 


The 


a(n, >+++>n,) = 8 + (o(nj>+-++sn)-L)+m ;: 


where Be M is the base address, and m is the num- 
ber of memory words occupied by each data item. 
The limitation of the basis to multi-~dimensionally 
ordered sets thus allows the use of a rigorously 
standardized addressing function -- an absolute 
must if the addressing function is to be executed 
by hardware. Hence, we consider a class of com- 
puter architectures where we have multi-dimension- 
ally ordered sets as the standardized internal 
data structure, called the basis and handled by 
the hardware of the machine. Fig. 1 presents a 
general diagram of such a data structure architec- 
ture. 


Of course, a data structure architecture 
shall process at the hardware level not only 
multi-dimensionally ordered sets but any of the 
structure types as defined in section 2. To this 


ee structure %S BASIS addressing DATA 
TURES definition function VECTOR 


Ld software-——__—_ Ll wardware ——__—_ 


Fig. 1 General Concept of Data Structure 
Architectures | 


end, other data structures must be mapped through 
an appropriate structure definition on multi- 
dimensionally ordered sets, i.e., on the basis of 
the data structure architecture. Therefore, a 
mechanism for structure definitions must be devel- 
oped, and it must be proved that all types of data 
structures can be defined in such a way. These 
stipulations can be satisfied by introducing an 
appropriate machine data model. A machine data 
model defines legitimate data types and struc- 
turing relations which are applicable for the 
definition of arbitrary data structures in terms 
of the basis. In order to mitigate the restric- 
tion that only rigorously standardized physical 
structures can be used for a hardware realization, 
a machine data model must be more general than the 
conceptual data models which were developed for 
generalized data~base management [6], [7]. 


4. The Linear Data Model 


As a basis for the design of data structure 
architectures, we define the linear data model. 
DEFINITION: The linear data model is based on 
the linear ordering as the only structuring rela- 
tion. Identifiers of basis elements are a data 
type of the linear data model. 

Unlike a pointer, an IDENTIFIER does not represent 
a reference to the identified basis element but 
the basis element itself [8]. The data items of 
a basis element are stored in a data vector. 
Hence, the linear data model defines basis ele- 
ments as ordered sets of data vectors. Conse- 
quently, multi-dimensionally ordered sets are the 
only basis structures permitted by the linear data 
model. 


Let A be an r-dimensionally ordered set whose 
components are denoted A[n,3--.5n_]. With the 


definition of the admissible ranges of the index 

values ni» ie [1 : r], in all index lists [n, 3-645 
r 

n e IN- such that n,€ [eet ds] and n, ef{l:d 

[njs--ein, i], Tete: 

(A[n,3---3n,_43 


dj (n,3-+-3n,_,]5n 


j 
r], the linear lists 
nS at 


-3nid,---,Alnj3---5 


i+1?*' i-1? 
pee 5B)> ie [1 : r], define 
cross sections of A which are denoted Aln,3+.3 
N,_43My473+-5n,J- We call dj[n,3.-.5n, 4] the 
dimension of the linear list A[n,$--3n, 137,413 
ost ls 

r 


The introduction of the linear data model as 
the fundamental notion for the design of data 
structure architectures is based on the 
THEOREM: The linear data model is necessary and 
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Proof. 


sufficient for the definition of linear lists, 
arrays, trees, generalized lists, and associative 
structures in terms of multi-dimensionally ordered 
sets of linear lists. 

(1) Necessity: Linear orderings consti- 
tute the simplest possible structuring relations 
with respect to the representation of data struc- 
tures in location-addressed memories. Linear 
lists are the fundamental basis elements, as they 
are identical with the structure of the under- 
lying data vectors. 


(2) Sufficiency: An r-dimensionally ordered set 
A can be represented by an (i-1)-dimensionally 
ordered set whose components are the (r-itl)- 
dimensionally ordered sets A[n,3---sn,_4]; aye 


[1 : i-l]. In the nota- 


tion A[n,3-- 


d,{njs---sn,_j]1], jefl 


.30 Sess e second index 
list [n,3---3n] specifies the components in the 


i-1! 
]. Thus, 


(r-i+1)-dimensionally ordered set A[n,3-+-+-5n 


as defined by the index list [M,3+++3M, 4) 


e e A aera 1 a N,5-.-53n } represent 
the components [n, ; ; ee D ; oe Pp 


the components A[n,3---5n,] of the r-dimensionally 


ordered set A. 
components A[n,3+-- 


By forming cross sections of the 
sn,_jJ[n,3---snil, an r-dimen- 


sionally ordered set A can be defined as an (i-1)- 
dimensionally ordered set of linear lists 


A[n,3-- sn, _1] = 
(An, 5-.-5n,_,J{1],---,Aln,s---sn,_1] 
[d,[n,3---sn,_,]]); 


such that the second index list [k], ke [1 : d, 


[n,3--.5n,_,]], specifies the (r-i)-dimensionally 


ordered sets A[n,3---5n,_,3k] which are the com- 


j_p)+ 0b- 


viously, for i=r, the above derivation defines 
an r-dimensionally ordered set A as an (r-1)- 
dimensionally ordered set of linear lists. It is 
readily recognized that the recursive application 
of the above definition leads to representations 
of r-dimensionally ordered sets as orthogonal 
interconnections of linear lists. Moreover, the ~ 
definition of the data type IDENTIFIER allows the 
representation of any set containment in an r- 
dimensionally ordered set in the form of linear 


lists A[n,3++-5n,_1] whose components A[n,3-+65 


ponents of the linear lists A[n,3---5n 


n, J{k] may represent arbitrary identifiers A[m, ; 
-oe5m] which, in turn, represent (r-j)-dimension- 


he oe Le tes. Ob= 


J 

viously, with the above definition of the data 
type IDENTIFIER, the r coordinates of an r-dimen- 
sionally ordered set correspond to r levels of 
substructure containment. Therefore, a linear 
list A(n,3+--5n,_1] of data type IDENTIFIER may 


ally ordered sets A{m,3+-- 


represent a node at the (i-1)st level of a hierar- 
chical structure and is thus a "parent'’ of compo- 


nents A[n,;.--3n,_,][k] which may represent nodes 


A{m,3---3m,] at any level of the hierarchical 


structure. In addition to the predecessor- 
successor relationships defined by linear order- 
ings, the introduction of the data type IDENTIFIER 
hence allows the definition of arbitrary parent- 


child relationships. 


A tree structure can be defined by the speci- 
fication of an r-dimensionally ordered set, such 
that the components A[n,3---5n,_,][k] of the 


linear lists A[n,3---3n,_4] of data type IDENTI- 


FIER exclusively represent the (r-i)-dimensionally 
ordered sets A[n,3---3n,_,3k]. As the components 


A[n,3---5n,_,][k] of linear lists A[n,5++-5n,_ ] 


i-1 
may represent identifiers of arbitrary (r-j)- 
dimensionally ordered sets Alm)3---sm,], it is 


obvious that generalized lists can be defined by 
the linear data model. 


The linear ordering of the memory locations 
in a location-addressed memory implies an order- 
ing of the elements of general sets in memory 
representations. Thus, linear lists are adequate 
logical structures for the representation of 
general sets in location addressed memories. Pos- 
sible nestings of general sets are also easily 
manageable through linear lists of data type 
IDENTIFIER. The latter property of the linear 
data model, and the ability to arbitrarily link 
multi-dimensionally ordered sets through their 
identifiers, may efficiently be applied for the 
definition of associative structures (q.e.d.). 


The above discussion of the efficiency of 
the linear data model shows that a hardware- 
associative memory would not facilitate the stor- 
age of basis elements,for components of multi- 
dimensionally ordered sets are uniquely identi- 
fied through its index list [n,3;...3;n_] ce IN’. 
Therefore, the multi-match capabilities of an 
associative memory cannot be exploited. 


5. The Internal Information Structure 


5.1 A Proposed Standardization of the Basis 


Elements 


So far, we assumed basis elements to be r- 
dimensionally ordered sets. In section 4 it is 
proved that one-dimensionally ordered sets 
(linear lists) are sufficient for the representa- 
tion of arbitrary data structures. However, we 
propose two-dimensionally ordered sets, given in 
the form of homogeneous, rectangular arrays 
(matrices) as the standardized basis element. 
Such a structure has the following desirable 
properties: 

(i) The dimension of the linear lists in the 
two coordinates of a matrix are the same, 
i.e., a basis element is fully specified 
by a dimension vector D = (d,-d,); where 
d, and d,, are the column dimension and 
the row dimension, respectively. 
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(ii) The addressing function is given by the 
simple expression | 
a(n, »n,) = (n,-1)-d, +n, -1+8 ‘ 
(iii) Matrices are the most important data struc- 


ture in practical applications. The exis- 
tence of several structuring relations 
within a linear list (as usually repre- 
sented by multi-linked structures) can be 
represented by a single basis element of 
data type identifier. 


Dez Memory Representation of the Basis 


The standardized basis elements are repre- 
sented by variable descriptors which contain the 
parameters of the addressing function a: IN* > M. 
The general format of these variable descriptors 
is defined by the triple 


VD = (a,s,b) 3 


with a = variable attributes (including data type 
specification), s = structure specification, and 
b = base address of the data vector. Hence, the 
data definition of two-dimensional arrays is 
obtained in the form of standardized variable 
descriptors 


VD = (<attributes>, (<column dimension>, 
<row dimension>),<base address>). 


Variable descriptors of this format can have a 
uniform length of one memory word. Thus, data 
definitions can be stored as named variable des- 
criptors, such that descriptor identifiers are 
equated with the memory locations which contain 
the associated variable descriptors. It is 
readily recognized that identifiers of basis 
elements as defined by the linear data model cor- 
respond with descriptor identifiers. In contrast 
to the von Neumann machine, where a machine var- 
iable is defined by a pair <variable> = (<loca- 
tion>,<value>), we have the following machine 
variable structure 


<variable> = (<name>,<value>) 
<name> = <descriptor identifier> 
<value> = (<data vector>,<structure 


specification>). 


This machine variable structure implies a 
two-stage value reference scheme through variable 
descriptors. The components of the data vector 
are accessed by executing the addressing function 
a for the structure specification given in the 
variable descriptor. This value reference scheme 
also applies to multi-dimensionally ordered sets 
which are represented by two-dimensional arrays 
of data type IDENTIFIER. According to the 
definition of the data type IDENTIFIER, references 
to components of data vectors of data type IDEN- 
TIFIER are automatically replaced by references 
to the identified variable descriptors. This 
indirect reference scheme can be nested to any 
arbitrary depth, resulting in an iterative appli- 
cation of the standardized two-stage value refer- 
ence mechanism. 


With the equivalence of coordinates of multi- 
dimensionally ordered sets and the levels of sub- 
structure containment (cf. section 4), we obtain 
a correspondence of n-l nested references of des- 
criptor identifiers in two-dimensional arrays 
with a (2n)-dimensionally ordered set. Let A be 
a two-dimensional array of data type IDENTIFIER 
which represents a (2n)-dimensionally ordered 
set with components A[m,;m,]{m,;m,]...[m, _j smo]. 


The components of this (2n)-dimensionally ordered 
set are the components of all two-dimensional 
arrays A{m,;m,]..-[m,__33m,,_9] which are accessed 


through n-1 levels of descriptor references. The 
descriptor references are defined by the descrip- 
tor identifiers Alm, sm,].--[m,,_)5my,] of the 


two-dimensional arrays A[m,;m,]...[m,,_,;m,,_5] 


of data type IDENTIFIER, ie [1 : n-1] (for i=1, 
Al[m_, 5m] = A). That is, the components A{m, ;m,] 


[m,3m,]..-{m,__13m,] are accesses through n 


iterative executions of the addressing function 


omy 427595) > "1,A{m, smy].-{m),_,3m), 9] 
ed 
2,Alm,sm)}.-[m,,_,3m,,_)] 
+n . : -1 
2,A[m,sm)]--[m,,_,3m);_9] 


+8 . 
Alm, .m)]..[m),_,3m),_9] ; 


je(1: n]. The base addresses 8 


A[m,3m,] ore [m, 5 _33 
= ] and the dimension vectors ae sm] 
2j-2 lay ae 
= (d 


. ° od 
1,Alm,sm)]..[m,,_,3m,,_5] 25K 


) are specified in the 


[m, 5-3°M24-2! 


[m, sm,].-[m,,_,3m,,_4] 
variable descriptors of the two-dimensional arrays 
A{m5m)]...{m,,_,3m),_)]- Each two-dimensional 

array A{m,sm,]...[m,,_33m,5_5 
ing of the two-dimensional arrays A[m,;m,]... 
[m) 513M], fe fl 


with indices 2i-l1 and 2i. Hence, in accordance 
with the definition of the data type IDENTIFIER, 
the machine variable A completely defines the 
ordering of all components A{m,;m,]..-{m, 43m), 


] defines the order- 


: n-1], within the coordinates 


] 


within all 2n coordinates. 


We call the above memory allocation scheme 
for basis elements a descriptor referenced allo- 
cation. With the specification of all necessary 
variable attributes and of the dimension vectors 
in the variable descriptors of the two-dimensional 
arrays, A{m, smo]... {m,,_j3m,,], variable defini- 


tions are completely self-descriptive. That is, 


descriptor referenced allocation allows for modu- 
lar variable definitions through variables of data 
type IDENTIFIER. The descriptor identifiers bind 
the variable definitions of basis elements, and 
hence, the descriptions of multi-dimensionally 
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ordered sets. The automatic replacement of des- 
criptor identifiers by the referenced descriptors 
builds up a complete structure specification by 
selecting the appropriate parameters for the iter- 
ative execution of the addressing function oa. 
Hence, the addressing function for two-dimensional 
arrays is the only tie between multi-dimensionally 
ordered sets of basis elements representing arbi- 
trary data structures and components of these data 
structures. 


The descriptor reference mechanism for vari- 
ables of data type IDENTIFIER does not prescribe 
a uniform data type for all components. Rather, 
the data type of the components is described by 
their variable descriptors. Hence, heterogeneous 
data structures can be defined. Furthermore, the 
unique definition of the coordinate dimensions of 
multi-dimensionally ordered sets by the variable 
descriptors at the different reference levels 
allows the construction of irregular data struc- 
tures. As shown in [9], the descriptor referenced 
allocation scheme can also be exploited to arbi- 
trarily restructure multi-dimensionally ordered 
sets without modifying or copying the underlying 
data vectors. To this end, the addressing func~ 
tion is extended into a generalized storage access 
function. In addition to that, the dimension 
vector in the variable descriptor of a restruc- 
tured variable is replaced by a description of 
appropriate structure functions. Restructured 
variables reference the variable descriptor of the 
variables from which they were generated through 
restructuring. With the automatic replacement of 
descriptor identifiers, the generalized storage 
access function then maps a components S[i;j] of 
a two-dimensional array S onto the data vector of 
a two-dimensional array A, from which S was gener- 
ated through restructuring. That is, the execu- 
tion of the generalized storage access function 
comprises the execution of the addressing function 
a and the execution of the stored structure func- 
tions. 


The descriptor referenced allocation of data 
structures is a refinement of the concept of self- 
identifying information components in tagged arch- 
itectures [10], [11]. In tagged architectures, 
self-identification provides the possibility to 
uniquely associate with each category of variable 
specifications dedicated control routines. Con- 
trastingly, descriptor referenced allocation 
defines self-descriptive data entities through a 
standardized basis which can be managed by a 
standard set of control routines. Basis elements 
are self-identifying. However, there is no need 
for a self-identification of different types of 
data structures, as they are uniformly constructed 
from self-descriptive components. The invokation 
of the appropriate standard control routines is 
completely described by the variable attributes in 
the self-identifying basis elements and the 
ordering of descriptor identifiers in variables 
of data type IDENTIFIER. 


The modularity of the internal information 
structure, as implied by the above implementation 
of the linear data model, suggests the separate 


storage of three basic information components 
[12], [13]. These are 


-- an instruction list IL, 
-- a variable descriptor list VDL, and 
~~ a data list DL. 


Hence, the internal information structure is 
defined by the triple 


(IL, VDL, DL) 


5.3 Machine Language Instructions 


Machine language instructions exclusively 
reference variable descriptors, i.e., we have the 
general instruction format (¥,VD3,VD,,VD,). Y is 


the operation code and VD. is the descriptor iden 
tifier of the result variable, whereas VD, and 
VD, are the descriptor identifiers of the operand 


variables. Hence, a data structure 

architecture processes basis elements in response 
to single machine instructions. 3-address in- 
structions are a prerequisite for the processing 
of ordered sets in a streaming mode. The machine 
language instructions may be grouped into the 
following categories [9] 


-- Scalar Operations 

-~ Reductions: 

-- Inner Products. 

-- Structuring Operations 

-— Transfer Operations 

-— Quweriies, 

-- Jumps 

-- Declarations and I/O Operations 


The first three groups are value-transfor- 
ming operations which generate a new variable 
descriptor and a new data vector for the result 
variable. Structuring operations create a new 
structuring of existing, data, i.e., they are 
solely performed on descriptors, not on data. 
Transfer operations primarily perform parameter 
transfers "by reference’ and "by value’ to and 
from subroutines. Queries apply to the basic 
components of the internal information structure, 
i.e., to variable descriptors and data vectors. 
Jumps constitute the program flow control opera- 
tions. 


The self-descriptiveness of stored data 
structures allows the creation of complete vari- 
able descriptors as part of the execution of 
assignment statements. Hence, variables are 
dynamically declared at run time. Consequently, 
the machine language is to. a large extent declara- 
tion free, except for input operations. Variables 
which are created by input operations must be 
declared as to their data type and coordinate 
dimensions. 


Normally, a sequential storage of data is 
inefficient if such data vectors are to be manipu- 
lated dynamically. In data structure architec- 
tures, this problem is circumvented by the capa- 
bility to manipulate variable descriptors through 


structuring operations. Furthermore, with des- 
criptor referenced allocation, unnecessary copies 
of data vectors can be avoided by the definition 
of different basis elements on the same under- 
lying data vector. The mechanization of the con- 
version of basis elements into data vectors 
achieves physical data independence. Hence, the 
reference of self-descriptive variables inmachine 
language instructions is not affected by the 
representation of data objects in the data vectors. 
A high degree of logical data independence is 
achieved by the fact that changes of data defini- 
tions through the creation of variables of data 
type IDENTIFIER do not affect other existing data 
definitions. 


Conclusion 


Attempts have been made before to provide 
hardware support for the generation of data struc- 
tures. One such example is the SYMBOL machine 
[14]. However, while the SYMBOL concept provides 
a mechanism for building structures, it offers no 
means for processing them. Ultimately, we may 
only then speak of a certain data structure of a 
machine if it comprises operators to perform 
transformations on the structure. Other authors 
[15,16] have recognized the necessity for data 
structure architectures but do not present a 
general solution. 


The concept of data structure architectures, 
as introduced in the paper, represents a novel 
approach that is radically different from most 
endeavors as yet so typical in computer architec- 
ture. The typical approach has been to multiply 
certain hardware resources (e.g., processors, 
memories, etc.) and arrange these modules into 
organizational structures which reflect certain 
task patterns. Contrastingly, our approach is to 
start from a general requirement of computing, the 
ability to create and process data structures, and 
develop a standardized logical model. It is shown 
in the paper that this is generally feasible, and 
the resulting information structure is described. 
Its modularity implies a high degree of orthogon- 
alization of the hardware, thus lending itself in 
a natural way toward parallel processing. In our 
opinion, the concept of data structure architec- 
tures presents a genuine alternative to the von 
Neumann concept in the realm of general-purpose 
computing. 


References 


[1] A. W. Burks, H. H. Goldstine, J. VonNeumann, 


Preliminary Discussion of the Logical Design 


of an Electronic Computing Instrument, (Part 
I, vol. 1), report for the U.S. Army Ordnance 


Department, 1946, in A. H. Taub (ed.), 
Collected Works of John von Neumann, Vol. 5, 
The MacMillan Company, New York, (1963), 

pp. 34-79. 


[2] P. Wegner, Programming Languages, Information 


Structures, and Machine Organization, 
McGraw-Hill, London, (1971). 


[3] 


[4] 


[5] 


[6] 


[7] 


[8] 


[9] 


[10] 


D. E. Knuth, The Art of Computer Programming, 


Vol. 1, Chapter 2, Addison Wesley, 3809, 
Second Edition, (1975). 


W. K. Giloi and H. Berg, A Uniform, Alge- 
braic Description of Data Structures, 
Computer Science Dept., Univ. of Minnesota, 
Tech. Report 76-15. 


J. A. Feldman and P. D. Rovner, "An ALGOL- 
Based Associative Language,'' CACM 12,8 
(August, 1969), 439-449. 


CODASYL Data Base Task Group, April 1971 
Report, ACM, New York, (1971). 


D. S. Tsichritzis and F. H. Lochovsky, 
"Hierarchical Data Base Management," 


Computing Surveys, Vol. 8, No. l, 
(March, 1976), pp. 105-123. 


W. K. Giloi, "BEYOND APL - An Interactive 
Language for the Eighties," Proc. ICS 77, 
North Holland Publ., (1977). 


H. Berg, A Computer Architecture Based on 
Ordered Sets as Primitive Data Entities, 
Ph.D. Thesis, Computer Science Dept., Univ. 
of Minnesota, (1977). 


J. K. Illiffe, Basic Machine Principles, 
American Elsevier Publishing Co., New York, 


(1968). 


aL 


[11] 


[12] 


[13] 


[14] 


[15] 


[16] 


E. A. Feustel, "On the Advantages of Tagged 


Architecture,' IEEE Trans. on Computers, 
Vol. C-22, No. 7, (July, 1973), pp. 644-656. 


W. K. Giloi and H. Berg, "STARLET - A Com- 
puter Based on Ordered Sets as Primitive 


Data Types," Proc. 2nd Annual Symposium 


on Computer Architecture, Houston 1975, 
pp. 201-206. 


W. K. Giloi and H. Berg, "STARLET - A 
Contribution to the Computer Architecture 
of the Post von Neumann Era," Computer 
Science Dept., Univ. of Minnesota, 

Tech. Report 75-21. 


W. R. Smith, et al., "SYMBOL - A Large 
Experimantal System Exploring Major 
Hardware Replacement of Software," Proc. 
AFIPS SJCC 1971, 601-616. 


Y. Chu, “Architecture of Hardware Inter- 


preter,’ Proc. 4th Annual Symposium on 


Computer Architecture, I[TEEE Computer 
Society Catalog No. 77CH1182-5C, 1-9. 


K. J. Thurber and P. C. Patton, Data 
Structures and Computer Architecture, 
Lexington Books, Lexington, Massachusetts, 
(1977). 


A MULTI-MINICOMPUTER APPROACH TO CONCURRENT 
COMPUTATION FOR INTERACTIVE ON-LINE 
SIMULATION OF COMPLEX BIOSYSTEMS (2) 
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It is the main purpose of this paper to 


describe our exper ience in designing and 
implementing an all-digital simulation system 
with the problems partitioned to run on a 


tightly-coupled complex of arithmetic processors. 
These ‘arithmetic processor modules are, in fact, 


modern high-speed minicomputers, More particu-_ 


computing 
specifically 


larly, we describe a new modular 
resource currently being developed 


to meet the needs of the biomedical modeler, A 


computer system well suited to the needs of this 
environment may be equally appropriate for use in 
the simulation of other complex systems and the 
approach taken in designing a simulation resource 
for biomedicine is described for the general 
interest of the computer science and engineer ing 

community. | | —_ 


The presentation follows in two principal 


parts. The first part is a discussion of the 
rationale for the development of a_ new 
multicomputer simulation system with a 
consideration of alternative approaches’ and 


associated trade-offs, This is then followed by 


a description of the overall system architecture 


and of the hardware and software that have been 
assembled and integrated into the now operational 


HMCS (MultiMiniComputer System). In addition it 
seems appropriate to consider some of the 
factors, economic and technological, that make 


such a system especially attractive at this time. 


to us that the machines 
typically used to support common’ modeling 
languages were less than ideal for this modeling 
task and that a multicomputer system could be 
devised that would be a much better match to the 
requirements of the modeling process. In 
particular in order to provide the compute power 
needed to work with complex models it seemed 
highly reasonable to provide parallel computing 
to better match the parallel nature of the 
systems being simulated, The system, as 
initially conceived, would be made up of a number 
of modern high-speed minicomputers operating 
concurrently. It was anticipated that such a 
multicomputer system could retain many desirable 
features and capabilities typically fownd in 
other modeling or simulation systems at a much 
improved cost-effectiveness level while providing 
a number of other significant advantages, 


It seemed clear 


(4)This work was supported in part by a 
Biotechnology Resource grant RR 00276 from the 
Division of Research Resources of the National 
Institutes of Health, 


The principal hardware components making up 


the currently operating MMCS are as follows: 


file), 
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1 Mapped Eclipse S/200 with 192K bytes of 

memory and hardware floating point | 

3 Eclipse S/200 with 64K bytes of memory | 
and hardware floating point . 

1 Floating-point array processor (AP120B) 


1 Mapped Nova 3/12 with 128K bytes of 
memory 
2 80 megabyte disk drive with controller 
5 MCA (Mult i-Commun ications Adaptor , 
allows memory to memory data _ transfer 
_ for all machines) | 
The development of system software for a 
multicomputer system can be an enormous task 
involving many man-years of effort, Our initial 


approach was to use Data 
ARDOS operating system, To run _ programs on the 
satellite computers’ the load (link-edit) 
processes for ARDOS was modified so that a small 
psuedo-oper ating-system (approximately 400 bytes) 
is inserted into each load module (core image 
This change has the far-reaching effect 
of allowing a load-module produced by any of the 
language processors to be executed on any of the 
computers whether or not an operating system is 
present, 


General Corporation’s 


The user software available to accomplish 
concurrency consists of a few primitives which 
may be called as subroutines from the various 
language processors. The primitives allow such 
functions as sending to or receiving from any 
other processor, reading or wr iting common 
memory, and testing or setting common flags. All 
concurrency is controlled directly by the 
pr ogr ammer . 


The great generality 
software configuration allows not only the 
traditional forms of concurrent processing but 
promotes the use of pipelining techniques as 
well, Our experience thus far shows pipelineing 
to be a much more widely applicable and useful 
technique than we had previously anticipated. 


of our hardware and 


Most of the biosystem simulations we have 
undertaken have been written in FORTRAN. In the 
interest of freeing the modeler from some of the 
coding tedium and numerical analysis aspects of 
working at this level there is a role for 
high-level simulation languages, The first major 
simulation language implemented on MMCS is DAREP 


(developed at the University of Arizona), DAREP 
is a lanaguage for describing systems of first 
order differential equations. The package 


includes hardcopy and CRI graphic capabilities, 


ARRAY TYPE VARIABLE TOPOLOGY MULTICOMPUTER SYSTEMS - 
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Summary 


A system architectural concept called Variable 
Topology Multicomputer (VTM) is proposed to imple- 
ment large networks of low cost computers linked 
with serial communication paths wnich can be recon- 
figured according to the needs of each computation 
\1\. VIM consists of N computer pairs called nodes 


interconnected with duplex lines. Each rode contains 


a loca! computer, a communications computer and an 
inter-computer message handler. The local computer 
executes user programmes whereas the communications 
computer is totally dedicated for message handling 
between the nodes. The inter-computer message hand- 
ler contains the input and output terminations so 
as to enable links to be established with ctrer 
nodes. 


VIM utilizes a synchronous communication sche- 
me where message carrying packets are transmitted 
during each fixed transmit time T+ repeated every 
main period time Tj, , common throughout the system 
50 that all nodes send and receive messages at the 
Same time. Transmission efficiency is defined as 
pp=lt/Tm- 

In a VIM system topolegy can be varied in two 
levels: physical and logical. By connecting wires 
hetween various nodes a desired physical network 
topelogy can be obtained. Over a given physical 
network, it is possible to establish logical con- 
nections between nodes with no direct link between 
them, by means of one or more intermediate nudes, 
using a packet switched or cicuit switched scheme. 


Organizing the VTM nodes as a two dimensianal 
mesh yields an array structure. Such a configura- 
tion has interest because of suitibility in many 
important fields of applications. A simulation 
model of the VIM system has been developed for per- 
formance evaluation[2|. Extensive studies have been 
carried out on an 8x8 VIM array structure. Boundary 
nodes have heen connected so as to obtain a closed 
toroid. The roucving matrix is computed by using a 
modified Floyd's algorithm fcr even load distribu- 
tion {3}. The characteristics of four typical topo- 
logies that have been tried are listed in Table 1. 
The hexagonal and cubic topology are also included 
for comparison. 


(*) This work 1s supported in part by an US Army, 
European Research Office, research grant 
(No. DAJA 37-36-0401). 
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Performance measures of message delay time, 
total system throughput, and buffer lengths are 
Simulated under various topology conditions to stu- 
dy the influence of say additional lines on messa- 
ge delay times. The transmission efficiency is a 
measure of channel capacity in the system. Its in- 
crease provides more slots for message transmission. 
This, however, reduces the period during which the 
local processing takes place and hence requests for 
transmission. For large pz; values the average delay 
time nears the average path length times T,. For 
smaller p+ delays due to queueing start to accumu- 
late. For very small py values congestion starts 
building up. Throughput depends very little on 
topology for large pz values and goes through a 
maximum as p+ is decreased. For small values of p 
the effect of topology is clearly seen. Determining 
the maximum value of ot 1S called "tuning" where the 
message generation rate is best matched with the 
message transmission capacity. The simulation re- 
sults have indicated that the toroidal organization 
of 8x8 mesh with alternate diagonal connections has 
interesting properties to make it a powerful candi- 
date for a general purpose multicomputer structure. 
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VIRTUAL INSTRUCTION SETS IN AN MIMD MICROCOMPUTER NETWORK 


. Melvin M. Cutler 
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Hughes Aircraft Company 
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summary 


While technology advances have greatly re- 
duced the cost of simple computing devices, it is 
not clear that a network of such devices operating 
in parallel provides a cost-effective solution for 
complex tasks. An ongoing research activity to 
define and evaluate microcomputer architectures 
for effective network implementation has charac- 
terized the generic features of state-of-the-art 
microcomputers, identified those features which 
impair network implementation, and proposed 
improvements [1]. A result of this effort, the 
implementation of virtual instruction sets within 
physical clusters of microprogrammed microcom- 
puters, is summarized here. 


The contemporary computer-on-a-chip is too 
limited and slow for effective networking. Thus, 
an assumption of the research is that a high- 
speed MIMD (Multiple~Instruction, Multiple-Data) 
network is implemented by microprogrammed micro- 
computers using bit-slice CPUs. Two generic 
features of such microcomputers serve as the 
impetus for our design: narrow (typically 16 
bits) instruction words and a CPU minor cycle 
which is two or three times as long as the micro- 
program memory cycle itself. A 16-bit format 
places a premium on operation code field width; 
thus, microcomputer instruction sets are small and 
general-purpose. A fast microprogram memory 
means that it is under-utilized by a single CPU. 
The solution we propose is to share a microprogram 
memory among a number of CPUs. This particular 
approach offers three advantages: 


e Execution speed is not affected 

e Arbitration logic is not needed 

e Hardware savings are converted to 
software and reliability savings 


The first two advantages are achieved by a 
"barrel switch" which allocates one microprogram 
memory access to each CPU during each of the CPU's 
minor cycles. Thus, if the CPUs operate with 
their minor cycles "out of phase'' from one another 
by one microprogram memory cycle, there is no 
change in execution speed. Regularity of micro- 
program accesses insures that no conflicts occur 
between CPUs, and that no arbitration hardware is 
required. Added cost is the access switching 
mechanism and the faster basic clock, which now 
runs at the rate of microprogram memory cycles 
rather than the rate of CPU cycles. The third 
advantage is a result of using the net hardware 
savings to expand the number and capability of 
(macro-level) instructions implemented in the 
shared microprogram memory. Thus, each cluster 
of microprocessors will have access to a large 
and powerful "real" instruction set. This set 
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might be indexed using an 8-bit "real" operation 
code, while the "virtual" operation code would be 
the microcomputer operation code, which might be 
6 bits wide. 


For any task, the applications programmer 
selects a 64-instruction subset of the 256- 
instruction set, either on an individual instruc- 
tion basis, instruction group basis, or functional 
instruction set basis. During execution, when this 
task is assigned to a microcomputer, the executive 
constructs the mapping from virtual instruction 
code to real instruction code for the particular 
microcomputer; further information on executive 


‘implementation and protection can be found in [1]. 


Each shared microprogram memory implements this 
mapping via a table addressed by a field which 
consists of a CPU ID code followed by the virtual 
opcode. The contents of this table is the 8-bit 
real operation code; the table look-up is done 
once per instruction. The microprogram memory 
contains one address register for each CPU in its 
cluster; while this system is less modular, its 
addressing is completely in the "real" space 
except for instruction sequencing. 


It is hoped that the above brief description 
of the implementation of virtual instruction sets 
is sufficient to convince the reader that this 
particular approach is appropriate to low-cost 
microcomputer networks. A popular approach, that 
of using writable control store, is far more costly 
because it requires the addition of low-density 
read-write microprogram memory (for each CPU) as 
well as data paths and control for reading into 
them. An alternative approach, implemented in 
the Burroughs B1700/B1800 [2], is an intriguing 
and low-cost implementation of virtual instruction 
sets via interpretation. However, for a micro- 
computer network with limited memory, the B1700's 
use of distinct interpreters for each (perhaps 
only slightly) different task would be wasteful of 
program memory space, and requires extensive 
sharing of common program memory (for interpreters). 
In summary, the proposed design is uniquely suited. 
to providing a network of microcomputers with 
powerful instruction set capability and with flexi- 
bility for degraded mode operation at virtually 
(sic) no additional cost. 
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summary 


In distributed network processing (I) 
the control and the functions of a distri- 
buted application are performed by many 
geographically dispersed sites. To define 
and to implemente such an application, one 
needs the existence of a Logical Network 
Machine and its operating language which 
take in the network the same part as a ba- 
sic software of a given general computer, 
This machine, named SIGOR, would supply 
the users with a set of tools necessary to 
facilitate the definition and the imple- 
mentation of distributed applications in 
an heterogeneous environment, These tools 
are represented by a transportable and in- 
terpreted language (2) which is able to 
run on all the machines of the network, 
This language defines the set of objects 
and basic functions linked to the design 
of distributed applications (3) : trans- 
port of algorithms (remote process initia- 
lisation, control of the algorithm's trans 
port, control of the distributed execu- 
tion), expression of parallelism (by using 
a variable of mode event and the following 
instructions : wait, post, multiple wait 
of n events among p, check), communication 
between processes (implicit communication 
of information, explicit transfer of in- 
formation), The Logical Network Machine 
SIGOR is realized on a multiprogramming 
support which conforms to the basic prin- 
ciples of a teleprocessing system (4). 


The operating language of SIGOR is a 
procedural type language (3). The proce- 
dure is the basic unit used for transport. 
Except in the case of explicit transfer of 
information and explicit synchronization, 
all the functions as communication and ex- 
pression of parallelism between processes, 
which interprete user's procedure algo- 
rithms, are done implicitly; tree of hie- 
rarchical processes, inter—-process proto- 
cols, finite states automata, queue to 
stack requests are defined in order to 
perform these functions, 
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It is hoped that this summary will 


served as a gateway to increase apprecia- 
tion of the Logical Network Machine SIGOR 


and to its probable descendant ; 


a high 


level network command language allowing 
users to define distributed algorithms in 
a network environment, 


(T) 


(2) 


(3) 


(4) 


(2) 


(6) 


(7) 


(8) 


References 


Andy Van Dam, Transcripts of the two 
Distributed Processing Workshop, (Aug. 
1976 and Aug. 1977), Brown Universi- 
ty, Providence, R.I., 02912 


M.N. Farza, G. Sergeant, Machine In- 
terpretative pour la mise en oeuvre 

d'un langage de commande sur le re- 

seau CYCLADES, These de Doctorat de 

3e cycle, Universite de Toulouse, 


(1974). 


NgeX-Dang, Systeme et Langage Porta- 
ble pour le traitement des applica- 
tions reparties, These de Doctorat de 
3e cycle, USMG, INPG, Grenoble, ( 


1977). 


NgeX.e Dang, V. Quint, J. Seguin, G. 
Sergeant, Presentation et Definition 
de SYNCOP, un sous-systeme de commu- 
tation de processus pour la tele-in- 
formaticque et les reseaux d'ordina- 
teurs, ENSIMAG, Rapport de Recherche 
no 64, (1977), 54 pp. 


M. Elie, H. Zimmermann, Transport 
Protocol, Standard end-to-end proto- 
col for heterogeneous computer net-— 
works, IFIP WG6.1, INWG 61, (May 

19 (2) 5. 39. BPs 


N. Wirth, Modula ; a language for mo- 
dular multiprogramming, Software - 
Practice and Experience, (1977). 


Brinch Hansen, Concurrent Pascal Re- 
port, Californie Institute of Techno- 
logy, (June 1975). 


Hoare, C.A.R., Monitors : an operating 
system structuring concept, Communi- 
cations ACM 17, I0, 549-557, (Oct. 
1974). 


ON THE CONSTRUCTION OF MICROPROCESSOR-ORIENTED OPERATING SYSTEMS 


7 


aL 


Martin Freeman’ 
Walter W. Jacobs 
Department of Mathematics, Statistics and Computer Science 
The American University 
Washington, D.C. 20016 


ttt 


and 


Leon S. Levy 
Department of Computer and Information Sciences 
University of Pennsylvania 
Philadelphia, Pennsylvania 19174 


Summary 


Microprocessors and semiconductor memories 
are becoming faster and cheaper. As this situation 
progresses, the constraint imposed on the number 
of pins available on these components will force 
us to consider more carefully the functionality of 
such components and their interconnection. 


In this regard, two approaches immediately 
suggest themselves as design philosophies for con- 
structing microprocessor systems: (1) provide a 
general interconnection network among microproces-— 
sors where data paths, control paths and communi- 
cation protocols are already specified, and try to 
map a (software) solution onto such a system; or 
(2) start from the general system functional spec- 
ifications (e.g. system requirements) and refine 
them into a logical design which provides a basis, 
in an implementation phase, for determining the 
(hardware/software) functionality of specific mi- 
croprocessors and a suitable interconnection struc- 
ture. 


In this paper we take the latter approach and 
describe a model (i.e. a conceptual framework) 
which forms the basis for the design and implement-— 
ation of microprocessor systems. 
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A PIPELINED DYNAMO COMPILER* 
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ABSTRACT - The design of a pipe- a. Compilation on a Network of Microcom- 
lined DYNAMO compiler which produces puters 


parallel code segments for a network com- 
puter is described. The network computer 
is dedicated to execution of a single job 


at a time. Phases of the compilation 
process, residing on separate computers 
in the network, cooperate to process an 


input source stream in a pipelined style 
but are constrained not to access global 
tables or intermediate files. The object 
code iS partitioned automatically into 
clusters by the compiler and the clusters 
are allocated to constituent computers 
for run time execution. Problems’ raised 
by the constraints are discussed and 
design alternatives to these problems are 
examined, 


Introduction 
A pipelinec DYNAMO compiler which 


produces parallel code segments nas been 
designed for the TECHNEC, a network com- 


puter at Illinois Institute of Technol- 
ogy. The TECHNEC will be a ring network 
of twelve LSI-lls. It is called a net- 


work computer rather than a computer net- 
work because the whole network will be 
dedicated to the execution of a single 
job at a time. 


The design aims to make full use of 
the parallelism provided by the network 
computer. At compile time, the compiler 
itself is organized in the form of a 
pipeline. Stages of the pipeline execute 
in parallel and cooperate by passing 
Statements in a conveyor belt style. But 
the communication is asynchronous between 
stages of the pipeline. The generated 
object code 1s partitioned automatically 
by the compiler into clusters which are 
to be executed in parallel at run time on 
the network computer. 

This paper iS concerned with the 
problems we have encountered and the 
alternative solutions to the problems. 
The deciding factors in solution selec- 
tion are the efficiency of the solution 
and the degree of parallelism exploited. 


Goals of the DYNAMO Project: 


This work was Supported by National 
Science Foundation under grant MCS 
76-91316. 
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This project aims to investigate the 
problems inherent in implementing a 
high-level language compiler which is 
distributed on a network of microcomput- 
ers. The compilation process is treated 
aS a single task and partitioned into 
cooperating subprocesses on the network. 
Each microcomputer is relatively slow 
compared with larger computers and_ the 
primary memory on each microcomputer is 
restricted in size. But a network of 
microcomputers as a whole serves as a 
powerful computing device by exploiting 
parallelism on the network. 


have been predom- 
control real-time 
processes and languages available on 
microcomputers are usually assembly 
languages. This attempt to implement a 
compiler distributed on a network 
represents an exploration of new applica- 
tions of microcomputer networks. 


Microcomputers 
inantly used to 


b. Pipelined Compilers 

The compiler will be in the form of 
a pipeline each stage of which carries 
out an individual phase of compilation. 


Each computer on a network has a 

primary memory but does not share 
any common global memory with another 
computer. A computer thus can only com- 
municate with other computers in the net- 
work by means of data messages. The 
TECHNEC on which the compiler is to be 
implemented is in the form of a unidirec- 
tional ring in which any computer may 
communicate with another by circulating a 
message around the ring. Each phase of 
the compiler receives a statement in the 
form of a message, converts it to some 
internal form and passes the converted 
statement to the next stage as a message. 


local 


C. 


Partitioning a Distributed Program 


Parallelism is to be exploited by 
executing the compiled object code in 
parallel on the computers of the network. 
The generated code is partitioned 
automatically by the compiler into code 
segments called clusters. This parti- 


tioning involves tradeoffs between 


speedup due to parallel execution of 
clusters and the amount of message pass- 
ing necessitated by communication between 
dependent clusters. 


dis Synchronization Between Clusters 

The compiled object code of the 
DYNAMO compiler will be in the form of 
program clusters which communicate by 


data messages. A communication mechanism 
has to be provided between the clusters. 


The style of synchronization for the 
clusters iS an important problem. 
Choice of Language 
Simulation of parallel processes is 


a basic concern of the Network Research 
Group at Illinois Institute of Technology 
because we view Simulation as a fundamen- 
tal part of the future development of 
networks. While simulation is often car- 
ried out on single processors there are 
Obvious conceptual advantages in simulat- 
ing parallel processes on a network of 
parallel processors. Clearly this kind 
of simulation iS a most natural = and 
appropriate task for a network computer. 


The decision to focus on continuous 
rather than discrete simulation was 
motivated by the concern of the Network 
Research Group with control processes. 
This group aims at investigating a_ style 
of control developed in [3] in which com- 
plex tasks requiring accurate coordina- 
tion of many variables are performed by 
distributed controllers, each handling a 
Stage of rough computation. The TECHNEC 
system [4] provides a hardware/software 
environment for experimenting with this 
Style of control. A control process is 
to be programmed as a collection of con- 
trol tasks each responsible for control- 
ling a subset of the variables. The 
whole TECHNEC is to be dedicated for the 
execution of a single control process at 
a time. These control processes will be 
studied with the help of simulation 
models. This design leads to the imple- 
mentation of a continuous) simulation 
language. An appropriate language tool 
must lend itself easily to problem decom- 
position. 


DYNAMO [2,5] is a well-known 
tinuous Simulation language. While 
DYNAMO presents serious complications in 
the areas of Sequencing and partitioning, 
it is easy to parse and its only data 
Structures are Simple variables and l- 
dimension arrays. Thus implementation of 
DYNAMO seemed to be a feasible step in 
the development of network software. 


con- 


is 
of 


model 
set 


Simulation 
by a 


A continuous. 


often represented 
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differential equations. DYNAMO [5] 
models a system with a set of variables 
called LEVELS and their rates of change 
called RATES. The differential equations 
are solved by determining the value of 
each LEVEL at regular time points and the 
corresponding RATE in an interval between 
adjacent time points. The value of a 
LEVEL at a Simulation time point is 
expressed as an integration of the 
corresponding rate over a regular time 
interval. A very simple integration 
scheme (the Euler or rectangular method) 
is used. The scheme is very efficient 
when no need for great accuracy exists. 
There are also AUXILIARY variables to 
help specify the relationship between 
variables especially in nonconservative 
systems. 


DYNAMO 
languages 


differs from procedural 

in that statements of a DYNAMO 
program are not sequential, i.e., they 
can be written in any order without 
affecting the outcome of the program. No 
oto nor conditional statements are pro- 

vided. In a traditional implementation 
for single processor systems, there is an 
implicit order of executing LEVEL vari- 
ables aS a group first, followed by AUXI- 
LIARYsS and finally RATES in one cycle of 
Simulation. There is still considerable 
freedom available in varying the order of 
execution of LEVELS since they are 
independent of one another. Similarly 
all RATES are independent of one another. 
This independence among LEVELS and RATES 


gives rise to opportunities in parallel 
processing. In a network computer such 
as the TECHNEC on which a single DYNAMO 


program is distributed, much more paral- 
lelism can be exploited. An AUXILIARY 
equation can be executed in parallel with 
a LEVEL or a RATE equation allocated to a 


different processor as long as they are 
independent. 
The state of a model is computed at 


regular time points. The length of the 
constant interval is designated by the 
symbol Df. The size of DT is chosen by 
the user. DYNAMO adopts the convention 
of attaching one of the symbols J, K, JK, 
or KL as subscripts to a variable to 
indicate the timing. The value of level 
ABC at the instant at which calculations 
are being made is referred to as ABC.K. 
Its value at the previous instant is 
ABC.J. The interval just passed is 
called the JK interval; the interval com- 
ing up is the KL interval. Since RATEs 
hold over an interval, their subscripts 
are either JK or KL while other variables 
have J or K as”~ subscripts. It is not 
necessary to attach a subscript to con- 
stants. 


Most DYNAMO statements 
statements whose 


are assign- 


ment right hand sides 


(RHS) are arithmetic expressions. We 
will use the term ‘equation’ interchange- 
ably with ‘statement’ since the assign- 
ment statement defines the value of the 
variable on the left hand side (LHS) at a 


particular time point. The type of the 
variable on the LHS is indicated by the 
first character in the statement. Thus 
in 

L ABC.K=ABC.J+DT*R.JK 

ABC is defined as a LEVEL. There are 
seven equation types: level (L), auxili- 
ary (A), rate (R), Supplementary (S), 
initial value (N), given constant (C) and 
table (T). Eacn variable is defined 


exactly once with at most one correspond- 
ing initial value (N) equation. 


To summarize DYNAMO has been adopted 


as a research vehicle for several rea- 
sons. First of all, Simulation of paral- 
lel processes iS a central problem in 


network development and a most important 
application for network computers. 
DYNAMO raises the central problem of par- 
titioning tasks in an urgent and immedi- 
ate fashion. Second, the Network 
Research Group needs a continuous simula- 
tion language to model control processes. 
Third, the simplicity of the syntax and 
data structures of DYNAMO make it a good 
Starting point for compiler development 
on networks. 


TECHNEC System Overview 


We shall present a brief introduc- 


tion to the hardware configuration and 
software facilities available on the 
TECHNEC. The emphasis is on the inter- 
face between the available software 


facilities and the DYNAMO Compiler. 


Hardware Configuration 


The TECHNEC [3] is a ring network of 


five nodes initially (Figure 1) with 
12 nodes planned in the second year of 
the project. Each node consists of a 


COSMAC (called the Ring Interface Unit - 
RIU) and an LSI-11 (called the Micro 
Processor Unit - MPU). COSMACS are 
linked together by I/0 ports to forma 
ring. Each MPU is attached to a 
corresponding RIU. All user tasks reside 
in MPUs. The RIUS are responsible for 
message communication among the nodes. 


Each MPU is a 16-bit LSI-11 with at 
least 12K words of RAM and floating point 
hardware. One of the MPUs has an RX-ll 


dual floppy disk and serial I/O inter- 
face. This node will be designated as 
the system node. A system console is 


attached to this node. The network will 
be connected to other computers on campus 
via modems. 


39 


Structure of TECHNEC 


Fiqure 1. 


The RIU is an 8-bit microprocessor 
with 1K bytes of RAM and three sets of 
I/O ports. One set of ports implements 
the - message communication path between 
adjacent RIUS. The message communication 
between RIUS is byte parallel and uni- 
directional. The other two sets of ports 
are used for communication with its 
corresponding MPU. One set implements a 
control/status and data buffer register 
interface and the other set serveS aS a 
DMA (Direct Memory Access) interface 
between the RIU and the memory of the 


MPU. An RIU may interrupt its MPU (but 
not the other way around) and it can 
access the 12K RAM of its MPU in the DMA 
mode. 


Software Facilities 


The operating system includes a mul- 
titasking executive called SEXTECH which 
allows multiple tasks to reside in one 
MPU and schedules uSer tasks in a simple 
round robin fashion. 


The TECHNEC supports two modes of 
message communication between tasks. One 
is the broadcasting mode in which one 
task passes a message around the ring via 
a ‘channel't and all tasks which are 
opened to receive messages at this 


specific channel may receive the message. 
The channels are virtual because no phy- 
Sical links are established between the 
tasks. A channel iS simply an identifier 
tagged to each message. This iS a one- 
to-many communication mode. The identity 
and the location of the receivers are not 
known to the sender. The other mode is 
point-to-point transmission in which a 
task transmits a message to exactly one 
receiver via a channel and the receiver 


is identified by a ‘subchannel.' The 
location of the receiver need not be 
known to the sender. | 

A collection of facilities such as 
the console management routine, file 
management, and debugging facilities are 
available on the system node. The con- 


sole management routine allows the system 


console to interact with user tasks via 
messages. The file’ management routine 
provides storage and retrieval of files 


resident on one of the floppy disks. The 
debugging routine provides functions such 
as suspension of a task, resumption of a 


task, modifying contents of a location, 
display of status, and breakpoints. A 
loading routine exists to load programs 


from the floppy disk or other external 


computers to the TECHNEC. 


Structure of the Compiler 


The compiler is structured in the 
form of a pipeline with the various 
phases of the compilation process distri- 
buted over the ring network. Each phase 
resides as a module on a separate com- 
puter of TECHNEC. A module receives one 
statement in the form of a message at a 
time from the previous module, performs 
one compilation phase on the _ statement, 
and passes the statement to the next 
module. Statements of the source pro- 
gram, originating from the system node, 
thus pass through the phases in order, 
with no feedback required. So one may 
consider the compiler to be pipelined in 
the same sense as pipelined arithmetic 
units. The code generated will return to 
a file at the system node. The system 
node behaves both as a source and a_ sink 
for the pipeline. 


severe constraints are 
the design of the compiler. 
First, no intermediate files exist 
between the phases. Each phase can be 
considered to be processing a statement 
in the statement stream through a window. 
Once a Statement 1S processed and passed 
to the next phase, neither the original 
nor the modified form of the statement 
will be available to the phase. 
Secondly, the individual phases cannot 
access global tables. Ideally informa- 
tion derived by each phase should be 
embedded in the internal code which is 
routed to Successive phases. Thirdly, 
the memory available to each phase is 
limited. 


Several 
imposed on 


constraints are not 
or physical limita- 


| The first two 

due to theoretical 
tions but are based on performance con- 
Siderations. A file or information 
tables at any node could be made accessi- 
ble to any process on TECHNEC via the 
interprocess communication mechanisms. 
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It is felt that message communication 


involves too much overhead in parallel 
computation on TECHNEC. No feedback is 
allowed in the compiler out of a desire 
to keep the pipeline as full as possible 
and to reduce message communication. The 


compiler as a whole is a one pass com- 
piler without the benefit of global 
tables. Moreover it is distributed on 


multiple computers. The compiler is com- 
posed of eight modules organized as in 
Figure 2: 


Se 


optional listing 
of source and 
expanded macro 
statements 


Macro 
Expansion 


Symbo} 
Table 
Routine 


error 
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i 
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' 
' 
' 
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en | 


Message 
Generator | 


Sequencing 
Module 
Code 
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Partitioning 
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Figure 2, Structure of the Pipelined DYNAMO Comp{ler. 


Error reports, originating from one 
of the first seven modules, bypass inter- 
mediate modules to reach the Error Mes- 
Sage Generator which produces symbolic 


error messages. 


Figure 2 represents the structure of 
the compiler in our current design. Ini- 
tially we placed the Code Generator after 
the Partitioning Module. We realized, 
however, that if the Code Generator pre- 
ceded the Partitioning Module, the latter 
would have more accurate estimation of 
the execution time and storage require- 
ments of statements in forming clusters. 


A statement is passed between 
modules in an internal form of a string. 
of tokens. A token has two fields: type 
and value. The value field indicates the 


symbol represented by the token (e.g., 
identifier, subscript, statement, etc.). 
The interpretation of the value field is 
dependent on the type field. For example 
the value field of a subScript token 
indicates the subscript type (J, K, JK or 
KL). The value field of a statement 
token denotes the statement type (level, 
auxiliary, rate, supplementary, constant, 
initialization, table, etc.). The value 
field of an identifier or a real number 
points to the original symbolic represen- 
tation in a character string which fol- 
lows the string of tokens. Organization 
of the string and information in the 
token vary from phase to phase. 


Scanner 


The input to the Scanner is the 
DYNAMO source program. The function per- 
formed by the Scanner is to transform the 
text input into an internal form of 
tokens. 


DYNAMO statement is a 
A routine is first called 
to scan the statement identifier (L, A, 
Ry. “SOPECy. “ebCs)s All statements excepc 
PRINT, PLOT, NOTE, RUN and title state- 
ments are scanned by tne same routine. 
Tnere are only four kinds of symbols that 
need to be dealt with: quantity names, 
Subscripts, numeric constants and 
Gelimiters/operators. A token is created 


Scanning a 
Simple matter. 


for each symbol and stored in the output 
message. When the statement 1S com- 
pletely scanned, the message is sent to 
the Macro Expansion program. 
Macro Expansion 

Tn® macro expansion module expands 
macro calls into one or more DYNAMO 
Statements, produces the source listing 


(optionally listing expanded statements), 


and assigns to each statement a unique 
number. 
The language requires tnat a macro 


definition appear before a call to it is 
made. This 1S important for a one pass 
compiler. The tokens for statements 
within a macro definition are stored in a 
table as the statements are received from 
the Scanner. Special tokens for 
occurrences of local variables, formal 
parameters, and macro names replace the 
normal identifier tokens to speed up 
macro expansion. The macro and a pointer 
to the macro definition are stored in a 
second table. 


At expansion time, the macro call is 
replaced by a compiler-generated identif- 


ier. Each statement in the macro defini- 
tion is processed by replacing local 
variables and occurrences of the macro 
name by compiler generated identifiers 
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and replacing formal 
actual parameters. As each Statement in 
the macro body iS processed, tokens are 
transferred to a message to be sent to 
the symbol table program. Macro calls 
are allowed in macro definitions and in 
actual parameters. These nested macro 
calls get exoanded in the same way. 


parameters by tne 


The source listing is produced by 
this module because it is felt that list- 
ing of expanded statements snould be a 
user option and it is desirable to per- 
form the source listing as early as pos- 
Sible to reduce mesSage passing overhead. 


The number assigned to a statement 


1S printed on the listing and stored in 
the internal form statement for the pur- 
pose of associating an error message 
(both compile time and run time) to a 
particular statement. 
Symbol Table Manipulation Routine 

The Symbol Table Manipulation Rou- 


tine is the third module in the pipeline. 
The function performed by this module is 
to replace the value field of quantity 
name identifier tokens by an index into 
the symbol table. It also checks sub- 
scripts in a given statement type, checks 
for multiple definitions of a quantity 
name, checks’ for undefined quantity 
names, and searches for conflicting use 
of subscripts. 


Converting value fields of an iden- 
tifier token to a symbol table index is 
Straigntforward. When a message is 
received from the Macro Expansion Module, 
each identifier is looked up in the = syn- 
bol table. Tf tne identifier is not 
found in the table, a new entry is made 
in the symbol table and the value field 
set to the table index. 


More work iS required to achieve 
subscript cnecking. The nonsequential 
nature of DYNAMO gives rise to the prob- 
lem of verifying a subscript appearing on 
the RHS of an equation. It should be 
emphasized that the pipeline does not 
permit a second pass through the source. 
One solution is to keep in the symbol 
table entry for each quantity name a bit 
to indicate whether or not the identifier 
is defined. If defined, a field indi- 
cates the equation type (L, A, R, S, N, 
C, CP, T or TP) in which the quantity 
name is defined; in this case the sub- 
script is immediately verified. If the 
quantity name 1S not yet defined, the 
equation type field iS a pointer to a 
linked list. Each node in the linked 
list contains a field for the statement 
number, a field indicating the equation 
type and a field indicating the subscript 
used. When the definition of the 


quantity name 1S encountered, the sub- 


Scripts for the previous references to 
the quantity name are verified for 
correctness. The Symbol Table Routine 
notices inconsistencies in use of sub- 
scripts. 
Parser 

The Parser is the fourth stage in 
the pipeline. Separate routines parse 
assignment statements, print and plot 
Statements, and specification statements. 
The main function performed by this 
module is to transform expressions from 


infix to Polish suffix notation. Polisn 


suffix notation was chosen as internal 
form because the LSI-1ll provides” stack 
operations. For this kind of stack 


machine, code generation from Polish suf- 
fix form is particularly simple. 


A transition matrix is used by the 
parser to handle arithmetic expressions. 
Transition matrix persing has the advan- 
tage of being a particularly robust pars- 
ing method. It also facilitates the pro- 
duction of good error messages. The main 
disadvantage of this parsing method is 
the space required for tne matrix. For- 
tunately, DYNAMO expressions are So res- 
tricted in form that the matrix for this 
language is of reasonable size. Operator 
precedence parsing is awkward for DYNAMO 
because it allows two operators to appear 
next to each other; A*-B and (A+B) (A-C) 
are both legitimate. 


seguencing Module 


The nonsequential characteristic of 
DYNAMO by no means implies that DYNAMO 
statements can be executed in any order. 
Determination of data dependency, 
Sequencing of statement execution, and 
initialization of the model are essential 
tasks of a DYNAMO compiler. Each DYNAMO 
assignment statement can actually be con- 


Sidered the defining equation of its LHS 
variable. Moreover, these statements 
define a partial ordering relation 


between variables in the program in which 
the LHS variable is a “Successor" of each 
variable in the RHS. A topological sort- 
ing algorithm can be uSed to produce a 
linear sequence consistent with the par- 
tial ordering relations. : 
As waS mentioned in Section 3, there 
is an implicit order of execution (or 
sequencing) of statements during each 
Simulation cycle, that begins with LEVEL 
equations followed by AUXILIARY equa- 
tions, and finally RATE equations. This 
Sequencing will be referred to as_ the 
"LAR looping sequence." The LEVELS can be 
computed in any order in the beginning of 
a cycle, and also the RATES can be exe- 
cuted in any order at the end of the 
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fore, 


cycle, (Sy. De -2ols The sequencing of 


AUXILIARY equations must be determined by 


the compiler. 


A DYNAMO compiler must provide the. 
initial values for those auxiliaries and 
rates that have no explicit (i.e., user- 
defined) initial value equations (N- 
equations) by treating the auxiliary or 
rate equation as an N-equation 
{[5, p. 25]. This actually means that, 
the first Simulation cycle is to produce 
the initial values for all quantities in 
the model. This is done by beginning 
with those quantities that have a user- 
defined initial value and executing the 
appropriate Statements in the model in 
the correct order to provide initial 
values for other quantities. This 
sequence is referred to as the "“initiali- 
zation sequence." In general the initial- 
ization sequence may differ from the LAR 
looping sequence (in fact they are dif- 
ferent in most practical examples). 
Hence, the Sequencing Module (SM) can be 
schematically represented by Figure 3. 
The input 1s fed to the sequencing module 
a statement at a time. Data dependency 
information is extracted from that state- 


ment and the Statement is passed 
untouched to the next stage. When a RUN 
statement 1S encountered, SM begins pro- 


cessing the accumulated information. 


Sequencing 
dule 


(SM) 


sequence of 
parsed 
statements 


parsed statements: 
constants sequence; 
initialization sequence; 
LAR looping sequence . 


Figure 3. Input and Output of the Sequencing Module 


Constant (C) and table (T) equations 
actually may be evaluated in any order at 
the very beginning in the initialization 
sequence. Moreover, constants and tables 
are the only guantities that may be rede- 
fined in case of reruns. It is, there- 
expedient to group constant and 

equations as a Separate sequence, 
to as "Constants Sequence" in 


table 
referred 
Figure 3. 


The Sequencing Module can only 
determine sequencing after examining the 
complete program. Again the three con- 
Straints mentioned at the beginning of 
this section come into play. To avoid a 
second pass and to keep the next stage 
busy while SM is functioning, each state- 
ment written by the programmer will be 
eventually converted to a subroutine by 
the Code Generator. The SM will produce 
a sequence of subroutine calls. 


The 
described 


Sequencing Module can be 
functionally by the flowchart 


in Figure 4. There are two logical parts 


in SM. The first part, consisting of 
modules Ml, M2, and M3, builds the data 
Structure and the second part produces 
the sequences. The data Structure used 


to convey the data dependency information 
in the SM consists of a set of linked 
lists. Each linked list, Shown in Fig- 
ure 5, corresponds to a variable in the 
program and contains the data dependency 
in both the defining equation of the 
variable and the user-defined initial 
value equation, if available. The data 
dependency information is conveyed in the 
form of a COUNT field, indicating’ the 
number of predecessors to the variable, 
and a successor list for that variable. 


M1 


Receive a statement 


token through the 


pipe. 


pass the statement 


to next module. 


add the appropriate 
information to the 
data structure that 
represent data depen- 

dency. 


no can the program\ yes 
be initialized? 


generate the appropria produce the "initialz- 


te error message. ation sequence" calls. 


produce the "LAR 


looping sequence" 


calls. 


Figure 4. Logical Flow of the Sequencing Module 


The SM then checks whether the model 
can be initialized and simulated prop- 
erly. The conditions for proper execu- 
tion are: 


a. All LEVELS are initialized using 
user-defined N-equations. This is 
checked using the T and NEQ fields. 

properly defined. 


b. All variables are 


This is indicated by a nonzero DEQ# field. 
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HEADER SUCCESSOR LIST1 


ta OUNT2} TOP2 


Fields are interpreted as follows : 


SUC2! NEXT 


SUCCESSOR LIST2 


T is a type field (indicating the type of the variable) 
NAME is the symbol table entry for the LHS variable 


NEQ is a pointer to the list representing the corresponding 
N-equation supplied by the user (0 if none) 
DEQ# is the defining equation number of the variable 


COUNT1 is the number of predecessors of the variable in DEQ 
COUNT2 is the number of predecessors of the variable in NEQ 


TOP! is a pointer to the first member of the SUCCESSOR LIST] 
TOP2 is a pointer to the first member of the SUCCESSOR LIST2 
NEXT1 is a pointer to the next entry in the SUCCESSOR LIST1 
NEXT2 4s a pointer to the next entry in the SUCCESSOR LIST2 
SUC1 is the symbol table entry for the successor in LIST1 
SUC2 is the symbol table entry for the successor in LIST2 
N# fs the user-supplfied N-equation number for the variable 
Figure 5. Data structure used in the Sequencing Module 
The "LAR looping sequence” calis can 


be generated directly from the "initiali- 
zation sequence" calls if no AUXILIARY 
variable has a user-defined N-equation. 
In this case, the LAR looping sequence is 
simply calls to LEVELS in any order, fol- 
lowed by calls to AUXILIARYs in the same 
order as in the initialization sequence 
(Same equation numbers), followed by 
calls to RATES in any order. 


On the other hand, if N-equations 
Supplied for some AUXILIARY equa- 
the data structure is searched for 

with zero NEQ fields. For each 
entry, a SUCCESSOR LIST2 is built as 
required by the language as a copy of 
SUCCESSOR LIST1 with N# equal to DEQ# and 
COUNT2 equal to COUNT1 field. This is 
necessary because the topological sorting 
program has to be run twice in this case. 


are 
tions, 
entries 


In general, the topological sorting 
program incorporated in the SM Searches 
for a zero COUNT field, produces the call 
for the corresponding equation number, 
and decrements by 1 the COUNT field in 
the header of each variable appearing in 
the SUCCESSOR LIST. To produce the "ini- 
tialization sequence" calls, in case a 
above, this procedure iS applied itera- 
tively using COUNT2 and SUCCESSOR LIST2 
for those entries with nonzero NEQ fields 
and COUNTL and SUCCESSOR LIST! for the 


others. In case b, COUNT2 and SUCCESSOR 
LIST2 are used for all variables. If the 
model is consistent and an evaluation 
sequence can be found, the SM produces 


calls in the correct order. Otherwise, 
the number of calls produced does not 
check with the number of statements pro- 
cessed and a “simultaneous equations" 
error message 1S generated from SM that 
contains thosSe variables for which no 
calls were produced. 


Code Generator 


The Code Generator receives from the 
SM parsed statements. It extracts the 
variables from a statement and sends ‘them 
as a list to the PM which will need 
predecessor-successor relationship among 
variables. The parsed statement is con- 
verted by the Code Generator to a subrou- 
tine in assembly language and sent to the 
code file. PRINT and PLOT statements are 
handled differently since routine check- 
ing and formatting are necessary. The 
Code Generator finally generates for the 
PM execution time and storage require- 
ments of each statement. 


Partitioning Module (PM) 


One of the main objectives of the 
DYNAMO project is to exploit parallelism 
by executing the compiled object code in 
parallel on the computers of the network 
(goal c). The partitioning module is to 
receive from CG lists of variables in 
each statement that is used to build a 
data structure representing the depen- 
dency between variables. After 
encountering the RUN statement, the 
module processes the data structure and 
produces a "processor assignment list" 
that specifies tne statements to be exe- 
cuted on each processor in the network, 
uSing the statistics provided for execu- 
tion time and storage requirements of 
each statement. This can be represented 
schematically as in Figure 6. The PM 
also iS Supposed to insert the required 
communication primitives between vari- 
ables in different partitions. Althougn 
the data structure required in PM has 
many Similarities with that of the 
Sequencing Module, since both of them 
reflect some sort of connectivity rela- 
tion, a main difference exists in sub- 
script treatment. In PM the initializa- 
tion cycle is completely ignored, because 
it occurs only once. Supplementary equa- 
tions are also ignored at this point 
because they are only executed during 
Special print or plot cycles. The main 
objective is to produce partitions that 
will reside on different processors at 
run-time in order to achieve the fastest 
execution of the program. Hence, it is 
clear that a relation between the initial 
value of two variables Vl, V2 is not as 
important as a recurrence relation 
between them. ba 
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‘duces 


list of communication calls; 
list of processor assignments; 
pseudo statements 


; Module 
(PM) 


variable 
lists 


statements 
statistics 


Figure 6. Input and Output of the Partitioning Module. 


Consequently, a 
representing two 
cycles of a DYNAMO 
built as follows: 


directed grapn, 
Successive Simulation 
source program, is 


(i) All C, N, S, PRINT, PLOT, SPEC, T 
Statements are ignored. An appropriate 
initializing process will be provided to 
initiate constant seguence and initiali- 
zation sequence. calculations and send the 
results to different partitions before 
the main iteration operation begins. 


(ii) A variable with a J or JK subscript 
iS represented by an entry node with no 
oredecessors. 


(i111) Every variable V is represented 
two nodes Nl, N2: 


by 


Nl representing V.J (or V.JK) 
N2 representing V.K (or V.KL) 


AS mentioned in (ii) Nl is an entry node. 


(iv) An arc from node Ni to. node 
represents a precedence relation, 
Ni is a RHS variable of an equation 
has Nj in its LHS. 


Nj 
1.e., 
tnat 


The main 
partitioning is 
parallel to this 
points investigated 
tioning [1] 


algorithm for automatic 
the subject of a study 

project. The main 
in automatic parti- 

can be stated as follows: 

(1) Two baSic approaches are Stu- 

died. 


being 


a. An 
on an 


optimal partitioning approach based 
integer programming model that pro- 
partitions of  DYNAMO code that 
takes minimum time to run on the network 
computer (taking into consideration the 
communication overhead). 


b. A heuristic approach that investigates 
different partitioning policies that can 
be incorporated easily at compile time. 


The tradeoff between the two 
together with comparative studies for 
different heuristics, 1s the main theme 
of that study. The study had reached the 
Stage of completing the formulation of 
the problem as a Mixed-Integer-Linear- 


approaches 


Programming (MILP) model of a reasonable 
Size. Test runs uSing sample DYNAMO pro- 
grams are being attempted using a stan- 
dard package (FMPS) for producing solu- 
tions on a UNIVAC 11868 processor. In 
addition four heuristics have been sug- 
gested and are being tested on the same 
Sample programs. In all these algorithms 
an important assumption has been made. 
Namely, every two nodes Nl, N2 represent- 
ing the same variable V at different time 


points, are grouped together in one com- 
puter. This implies that a variable is 
assigned to one processor during the 
whole simulation period. 

(2) In the MILP model a combination of 
synchronous and asynchronous modes (see 
the section on Run Time Synchronization 


of clusters) is assumed. This assumption 
does not affect the resulting solution, 
but mainly influences its optimality. 


(3) The MILP model generates not. only 
processor assignment, but also the 
optimal starting-time-values- for each 


variable. This 
the LAR looping sequence instead 


can be used to generate 
of the 


scheme described above. The latter pro- 
duces a feasible but not necessarily 
optimal sequence. 

The Symbol Table, the Sequencing 
Module, and the Partitioning Module all 
generate data structures involved with 


connectivity between statements which are 
related but not identical. The first two 
constraints imposed on the design 
motivate these distributed data struc- 
tures. Some duplication does occur, but 
eacn data structure is tailored specifi- 


cally for the phase and is thus more 
efficient. Moreover the phases are exe- 
cuting in parallel. A single central, 
general data structure can only be 


accessed sequentially. 


Error Message Generator 


The error message module is the last 
logical step in the pipeline. This 
module differs somewhat from the other 
modules in that there iS more than one 
input source. Error messages are 
received from any of the other modules 
except the code generator module. 


Each error message contains an error 
message number, a line number, and, 
optionally, one or more identifier tokens 
and/or text tokens. The error message 
number is used to retrieve an error mes- 
sage from disk. The line number is 
printed with the error message to  indi- 
cate where the error occurred. Variable 
text, @.g., a Subscript name, is simply 
inserted where required. After the error 
message is formatted, it 1S sent to a 
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print program. 
Error Recovery and Error Correction 


A compiler must be able to discover 


aS many errors as possible before ter- 
mMinating. This implies that a good error 
recovery scheme must be incorporated in 


every good compiler. 


When it comes to error correction, 
however, there iS Serious conflict, par- 
ticularly in regard to DYNAMO subscript 


errors. We believe on principle that a 
compiler should not correct user errors. 
This is a complicated process which may 
lead to unpredicted and unreliable 
results. What 1S more, uSer errors may 
Signal defects in the model. We can 
serve the user better, we think, by flag- 
ging errors and forcing him to correct 


On the other hand we also FE :lieved 
we have 


them. 
in language standardization and 
tried to follow the DYNAMO manual as 
closely as possible. According to the 
DYNAMO manual, all subscript errors are 
considered nonfatal [5 p. 53-54]. 


Run Time Synchronization of Clusters 


A cycle of Simulation consists of 
execution of LEVEL statements, AUXILIARY 
Statements if any, RATE statements and in 
certain specified cycles SUPPLEMENTARY 
statements for PRINT and PLOT statements. 
Due to dependency between statements, 
values may also be passed between cycles. 


The evocation of cycles and state- 
ments within a cycle can be performed 
synchronously or asynchronously. By syn- 
chronously we mean statement execution of 
cycle initialization are evoked by a sig- 


nal given by a central process. In the 
asynchronous mode, no timing mechanism 
exists to control the timing of evoca- 
tion. Each process evokes its logical 
successor. 

There are a number of ways to syn- 


chronize partitions: 


Option l. (Synchronous Mode): Partitions 
are evoked by a global signal at the 
beginning of a cycle and evoked to send 
and receive messages at the end of the 
cycle. This mode stipulates that no mes- 
sages be passed between partitions in the 
same cycle. It implies that values 
required for the execution of a statement 
are either available at the beginning of 
the cycle or generated by the partition 
itself. Only intercycle data dependency 
is taken care of. This mode imposes a 
serious constraint on vartitioning. 


Option 2. (Synchronous Mode): An = evoca- 
tion signal is provided for each class of 
Statements of the same type and a signal 


in between classes for message passing. 
This approach allows more freedom in par- 
titioning and message passing between L- 
A, A-R, L-R, L-S, A-S, R-S pairs. The 
price paid is additional signals and 
reduction in speed. The execution time 
of a cycle is the sum of maximum execu- 
tion times for each class of statements 
plus the maximum transmisSion times. 
Moreover a message cannot be sent once 
the value of quantity iS available but 
must wait for the synchronization signal. 


Option 3. (Asynchronous Mode): In the 
completely asynchronous mode, each state- 
ment iS executed once all the required 
values on its right hand side are avail- 
able. It is conceivable that one parti- 
tion may rcun a number of cycles ahead of 
another. Data messages may nave to be 
tagged by cycle numbers or a FIFO queue 
is needed between partitions that commun- 
icate with each other. 


Option 4. (A Combination of 
and Asynchronous Modes): Cycles are 
evoked by a global signal. The broadcast 
message facility is used advantageously 
for thiS purpose. Intracycle messages 
Aare sent asynchronously. A partition may 


Synchronous 


send a valu® needed by another using the 
point-to-point communication scheme. A 
partition may also pause to wait for a 


data message. Each partition may inform 
the signaling mechanism of its readiness 
to start -a new cycle which implies com- 
pletion of all execution and intercycle 
data transfers. When all partitions are 
ready, the Signaling device generates a 
Signal for the new cycle. If n parti- 
tions exist, then n messages plus ones 
broadcast are necessary. A broadcast 
message from the signaling mechanism to 
poll each partition's readiness 1S a more 
efficient solution. But a good estimate 
on the cycle execution time is important 
to avoid multiple pollings. 


The fourth option is favored for run 
time synchronization. 


summary 
We have described the design of a 
pipelined DYNAMO compiler to be imple- 


mented on a network computer. The goals 
are to make use of parallelism available 
both at compile time and run time. At 
compile time the compiler itself is 
organized in the form of a pipeline. 
Each stage of the pipeline executes in 
parallel and communicates asynchronously. 
The object code is automatically parti- 
tioned into clusters by the compiler. so 
that the clusters execute in parallel on 
the constituent computers of the network 
computer. The oroblems raised by the 
objectives and the constraints of the 
environment are discussed and alternative 
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solutions to these problems are examined. 


J. Forrester, 
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A COMPARISON OF VARIOUS METHODS FOR DETECTING AND 
UTILIZING PARALLELISM IN A SINGLE INSTRUCTION STREAM 
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Sudbury, Massachusetts 


Abstract -- By analyzing the data depend- 
ency graph of a program it is possible to determine 
the potential for program speedups by simultaneous 
execution of logically independent operations. 
When concurrent execution of instructions in 
existing programs on a given machine is attempted, 
efficient detection of data independence during 
execution is a central difficulty. Simulation, 
using actual program traces, has been used to 
evaluate the effectiveness of several approaches 
to detecting the presence of logically indepen- 
dent operations as a function of the number of 
processing elements, The results indicate that 
simple conflict detection algorithms perform 
about as well as more complex detection algorithms 
if the number of processing elements is six 
or less. The complex algorithms continue to 
show performance improvements as the number of 
processing elements increases, whereas, perform- 
ance levels off if the simple algorithms are 
used. The rate of this increase indicates that 
the additional improvement achievable probably 
does not justify the increased cost of the 
complex detection mechanisms and the additional 
processing elements. 


Introduction 


The idea that program speedups can be obtained 
by simultaneous execution of logically independ- 
ent instructions has received the attention of 
numerous researchers and practitioners [6,9,10, 
11]. While some authors have reported that 
utilizing potential parallelism can give 
program speedups of a factor of 50, computer 
manufacturers have settled for actual perfor- 
mance improvement in the 1.5-3 range. There 
are two main reasons for this: 


1) Theoretical work has tended to ignore 
the fact that the dependency graph, on 
which the more optimistic estimates are 
based, must be constructed during run time 
from a program stored linearly in main 
memory. If a computer utilizing the potential 
parallelism inherent in the dependency graph 
is to be cost effective, the hardware to 
detect the data dependencies present in the 
code must be fast, yet it cannot overshadow 
the multiple execution units in cost. 
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2) The problem of effectively handling 
conditional branches has not been solved. 
In pipelined machines,as described in Mg 
below, both the next sequential instruction 
and the instruction branched to if the jump 
is taken can be conveniently prefetched. 

In machines with this type of architecture 
the test to abort the inappropriate branch 
can be made before any instructions along 
that branch have reached a point where re- 
covery of the correct state is difficult. 
The conditional execution of (several ) 
instructions along either path after a 
branch, before the test can be resolved, 
can lead to a large quantity of state 
information, in fact the amount of state 
information can grow exponentially 

Since the instructions along the paths 

may themselves be branches. 


Many researchers have felt that the conditional 
branch problem is the main reason that the 
potential parallelism in code is not better 
utilized. In this paper we analyze the effects 
of the problem raised in point one: How much 
parallelism is actually present in existing 
code, and how much does the technique used to 
detect and utilize this parallelism degrade 
performance from the ideal? 


In the next section we shall review the 
theory associated with concurrent execution of 
logically independent instructions, pointing 
out a number of problems and subtleties not 
previously noted in the literature. After that 
we will present several machine models which 
detect and utilize parallelism in a single in- 
struction stream in different ways. Some of 
the models embody the theoretically ideal data 
dependency detection mechanism, while still 
incorporating the realistic limitations of 
non-zero instruction decode time and main 
memory fetch time and an addressing structure 
similar to those found on many current computers. 
An empirical upper limit on performance improve- 
ment can be obtained for a given piece of object 
code by executing it on a (simulated) machine 
employing the ideal data dependency mechanism. 
Other of the models are based on data depend- 
ency detection mechanisms that do not fully 
exploit the parallelism inherent in the code, 
and thus are not as complex to implement. In 
the final section of the paper, simulation re- 
sults and an interpretation will be presented. 


Theory of Concurrent Instruction Execution 


The abstract theory of program speedup by 
concurrent execution of logically independent 
instructions is well documented |5, 6, 9, 10]. 
In fact by appropriate interpretation the theory 
can be applied at a number of levels. 
Definition: Let Ty ToreeesT be a sequence of 
elemental operations, each with a well defined 
set of input variables and output variables. 

We define an ordering relation &)on the elemental 
operations as follows: 


1, ©r, if and only if i < j and at least one of 


the following three conditions holds 


(i) an input variable of T. is an output variable 


of qT, 

(ii) an enue variable of T, is an output variable 
of T. 

(iii) T, and T. have an output variable in common. 


The transitive closure of ©)aefines a partial 
ordering, <,on the set of elemental operations. 
From this definition it is possible to construct 
a data dependency graph (see Figure 1). It is 
customary to include only those arcs that cannot 
be deduced by transitivity. If execution times 
are associated with each node of the graph, we 
have the following: 


Principle of Optimality: Given unlimited re- 


sources, the minimal execution time of a program, 
sequentially specified as Ty ToreeerT is 


equal to the length of the taeiee path in the 
dependency graph (the length equals the sum of the 
execution times of the nodes along the path), 

and this minimal execution time can be realized 

py starting an elemental operation as soon 

as all its predecessors (in the partial ordering) 
have completed. 


This model has been specialized in a number 
of ways. Graham [4] and Coffman [2] have in- 
terpreted “elemental operation” as a job and 
have considered scheduling interrelated jobs on a 
multiprocessor system (with limited resources). 
Brinch-Hansen [1] has treated "elemental 
operation" as a procedure or begin block, 
the user to specify parallelism in a higher 
level language. At the other end of the spectrum, 
Tsuchiya and Gonzalez [12] have performed automatic 
optimization of horizontal microcode within the 
constraints imposed by the dependency graph that 
results from considering as "elemental operations" 
logically indivisible sub-instruction functions. 


allowing 


In the research reported below we will be 
adapting this abstract model to execution of the 
instructions in a Single user program. A number 
of subtleties arise in this case. It should be 
noted that the points presented below can be 
incorporated into the abstract model, by either 
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Slightly 


ctions, 


odifying the definition of ordering 
relation), or by carefully defining the input 
and output variables. As with many other simulation 
problems the difficulty in building an accurate 
model is determining what i of the problem 
are the most relevant. 


The first class of subtleties deals with why 
the principle of optimality does not really produce 
the minimum possible execution time. 


1) Changing the elemental operations. By 
changing the choice of elemental operations 


the total execution time may be reduced. 
Formally we have 


Definition: A refinement of a sequence of elemen- 
tal operations, T,,T9,..+,Ty, iS a Sequence Ti 


T1Q0¢eesT my T21+T22e¢ eer FQmoe seer tny ’ 
TharceesTnm + such that 


: 8 2 e 
(i) for i# j, if me fy then T, Ts, 


(ii) for i # j if T. < Ths then there exists 
< 
u and v with Ty Ty » and 
(iii) the longest path (where the length of a 


path is defined to be the sum of the 
execution times associated with the ponds 
along that path) in the subgraph T. 


il' Trae 
reorla equals the execution time for 
node Ty. for all i, i.e. the execution 


time for the subgraph into which T; is 


decomposed equals the execution time of 
qT; when viewed as a whole. 


It is not difficult to prove that the minimal 
execution time of a refinement is less than or 
equal to the minimal execution time of the 
original sequence. Intuitively, the subsequence, 
Dosa hate sary ls performs the same task as T., 

il i2 im, Ls 


This condition can also be formally stated, but a 


precise statement of this condition is not im- 


portant here. 


The notion of refinement is relevant to the 
current discussion since there are two natural 
choices for elemental operations: Machine instru- 
like load accumulator number five from 
the main memory location symbolically labeled I 
(L A5,1I), or subinstruction functions, like compute 
an address, fetch an operand, etc. Figure 2 shows 
the same program segment as Figure 1, but with a 
different choice of elemental operations. Because 
of the environment within a computer, detecting 


dependencies at the subinstruction level is not 
more difficult than at the machine instruction 


level. In the simulation results reported later 
subinstruction functions are used as the elemental 
operations. 


2) Restrictive instruction format. Even if 
a computer contains an unlimited supply of 
arithmetic-logic units, a rigid instruction 
format or lack of a sufficient number of general 
registers may introduce dependencies in the machine 
code not implied by the higher level language 
statement of the program. Inefficient use of 
general registers or poor code generation by a 
compiler can also create such dependencies. 
Dependencies introduced for these reasons normally 
manifest themselves as dependencies due to 
conditions (ii) and (iii) of the basic definition 
of the ordering relation. The arcs marked with 
asterisks in Figures ] and 2 represent such de- 
pendencies. Keller [5] discusses the technique 
of “virtual registers" which can be used to, 
eliminate these dependencies, and thereby, 
(potentially) reduce the minimal execution time. 
When viewed theoretically, the technique amounts 
to having an infinite number of input/output 
variables available for use with the elemental 
operations, and using each variable for output 
only once (it can subsequently be used for input 
indefinitely.) Practical implementation of the 
virtual register technique may be quite costly 
and the necessarily non-zero time to use the 
additional hardware may negate any expected per- 
formance improvement. Careful examination of 
code from machines with numerous general registers 
and register-memory and register-register in- 
structions (UNIVAC 1100 series and IBM 360 
series are typical) indicates that careful register 
allocation makes the potential gain from the use 
of virtual registers quite small in most real 
applications. At the other extreme, in machines 
with one accumulator this problem is so severe 
that almost no program speedup is possible 
without using virtual registers. 


3) Alternate program formulations. The 
Sequence T,, T.,-.-,1. may be able to be replaced 


by another sequence we, W5,ee+,W , which accom- 
plishes the same job. Kuck [6] and Lamport | 7] 
have investigated speedups obtainable by semantic 
analysis of FORTRAN programs. Careful analysis 
of the algorithm being employed, with subsequent 
recasting to take advantage of vector/array 
features of the hardware can produce dramatic 
improvements. 
a speedup factor of 25 for code carefully reworked 
for the CDC-STAR. The reprogramming effort took 
several years, however. Such techniques will not 
be investigated here. 


The points mentioned above demonstrate that 
unless precautions are taken, determining 
potential program speedups from the dependency 
graph of a program can yield results that 
are too low. There are also a number of pre- 
Cautions that must be taken to avoid overly 
optimistic estimates of performance improvement. 
We discuss one here, since it is relatively 
abstract in nature; others are discussed in the 
next section. 
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A recent paper by Giroux [ 3] reported 


4) Unresolved addresses. This problem is 
not apparent if the elemental operations are 
machine language instructions, and is easily 
overlooked if the dependency graph is generated 
from an assembly language listing containing 
symbolic addresses. On many third generation 
computers absolute addresses are not part of 
the instruction, but are computed during run 
time by adding a displacement contained witin 
the instruction to the contents of a base 
register and, possibly, the contents of an 
index register. This addressing scheme permits, 
amongst other things, shorter instructions, 
greater run time flexibility in managing 
storage and makes array referencing more natural. 
Run time computation of absolute addresses poses 
a significant hazard for the potential for 
concurrent execution however. Consider the 
symbolic code segment: 


The fetch from memory location J cannot be 
safely initiated until it is known that there are 
no uncompleted (including uninitiated) stores to 
memory location J, preceeding this instruction. 
This implies that the real address symbolically 
represented by I will have to be computed and 
this address and the real address symbolically 
represented by J will have to be compared for 
possible conflict before the fetch can be safely 
initiated. Note carefully, we have a dependency 
between computing the real address of I and 
fetching from the memory location addressed by J. 
NOT between the actual store into I and the fetch 
from J. In fact no fetches or stores past the 
S AO,I instruction can be safely initiated until 
after the address of I is known. In the situation 
just described, where the base register is implied 
(and not modified frequently), and no indexing 
is performed, no delay will actually occur. The 


reason for this is that by the time the real address 


symbolically represented by J is known, so the 
fetch could be initiated, the real address sym- 
bolically represented by I will also be known. 
However, if a store is being made into a location 
whose address is computed using an index register, 


like S Al ,A(X0O), then the computation of the 


address symbolically represented by A(X0), can be 
delayed a long time due to a dependency on xO. 


Thus every fetch past the S Al ,A(XO) will be 
(indirectly) delayed, waiting for an earlier 


load index register instruction to complete, 
even though there may be no conflicts over 
actual data. There is no a priori way of deter- 
mining how much this indirect effect will lower 
the potential for program speedup. In practice 


this problem arises naturally in two ways: 

1) in scientific code, where indexing is a common 
occurrence, and 2) in the object code of programs 
written in ALGOL-like languages, where an index 
register is used as a base register to address 
variables whose scope is global to the currently 
active block. To the best of the author's know- 
ledge nobody has investigated the effects of this 
problem on potential program speedup. 


Machine Models 


From the points discussed in the last 
section, determination of the potential for 
performance improvement by concurrent execution 
of logically independent elemental operations must 
be done with care if the results are to have 
credibility. A major additional problem remains 
when we consider building a computer to utilize all 
or some of this potential. How can the parallelism 
inherently present in a single instruction stream 
be efficiently utilized? The importance of this 
question cannot be stressed too much, since the 
data dependency detection mechanism creates an 
overhead cost in addition to the increased 
cost due to the presence of the multiple arithmetic- 
logic units needed to perform the computation. The 
(non-zero) time the detection mechanism takes to 
function must also be considered. 


Four machine models were developed and a 
Simulation was performed to determine how much 
of the potential program speedup could be 


realized by each one. The models differ primarily 
in the way they detect parallelism and in how 


they utilize the parallelism once found. The 
phiiosophy has been to determine the maximum 


possible potential parallelism within the constraints 


of each model. Toward this end the following 
properties are common to all four models. 


1) 
functions. These include compute an 

address (see the discussion below), fetch/ 
store from/to memory or a register, perform 
the basic algebraic or logical operation, and 
fetch and decode an instruction. The time 

it takes to perform each elemental operation 
is a parameter of the simulation. The rate 
at which instructions stored sequentially can 
be fetched can be set faster than main 
memory speeds, effectively simulating a 

high speed instruction buffer. 


2) The virtual register technique is used 
to eliminate data dependencies arising from 
conditions (ii) and (iii) of the 

definition of the ordering relation. 


3) Memory bandwidth is assumed adequate to 
handle the requests generated. Delays due 
to memory bank conflicts or cache misses 
(if memory cycle time is set sufficiently 
low as to imply a eache memory is being 
used) are ignored. 
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The elemental operations are subinstruction 


4) When a conditional branch is encountered 
the correct path is traversed, even before 
the test is completed. The simulator can 
perform in this manner because actual program 
traces are utilized, so the next instruction 
actually executed is available. This attitude 
is essentially the one taken by Riseman and 
Foster [9] in their earlier simulation 
experiments. Some factors which negate the 
clearly too optimistic nature of this 
approach are discussed below. 


The goal of achieving the maximum possible 
potential parallelism must be tempered by 
reasonable constraints if the models developed 
are to produce meaningful results. The following 
properties, which are restrictive in nature, are 
common to all the models. | 


2) The addressing structure of the under- 
lying instruction’ set reflects that used 
on a number of widely available machines. 
In all four models the final memory address 
is computed by adding a displacement to a 
base register and to an optional index 
register. As was discussed in the last 
section the computation of final addresses 
during execution impacts potential performance 
improvement by indirectly inhibiting all 
future fetches. 


6) The too optimistic estimates developed 
by assuming foreknowledge of the way in 
which the test in a conditional branch will 
resolve is ameliorated by two constraints 
imposed on the models: 


- The time between recognizing that an 
instruction is a conditional branch 
and the decoding of the next instruction 
to execute is greater when the branch is 
taken than when the next instruction to 
execute is fetched from the next sequential 
location. This reflects the fact that 
the "jump to" address must be computed 4 
before the instruction can be fetched. 


- No stores into registers or memory are 
permitted after a conditional branch, 
until the test is made. The philosophy 
here is that no irrevocable actions 
should be taken until it is guaranteed 
that they will occur. 


These two points, coupled with a non-zero 
instruction fetch time and the observed 
average distance between successive 
conditional branches in actual code, help to 
explain why the speedup factor of 50 reported 
by Riseman and Foster 9] is not confirmed 

by this research. 


7) The amount of hardware that is dedicated 
to performing instruction execution (as 
opposed to hardware for detecting data 
dependencies) can be limited. 
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Definition: Suppose the instructions in the 
program trace are numbered consecutively (in 
the order they executed), and suppose a 
number N, known as the window size, is 
given. An instruction numbered i, is 

called active if and only if instructions 

1, 2, «e-, i-N have completed. In other 
words, the first uncompleted instruction 

and the N-1 instructions following the 

first uncompleted instruction are active. 


Note that an active instruction need not 
actually be making forward progress; 

all of its uncompleted subinstruction 
functions may be waiting on data depend- 
encies. The window size is a parameter 

of the simulation. When the window size 
is set to one all four models reduce to 

a common denominator--a computer which 
executes one instruction at a time, using 
parallelism within the instruction, and 
overlapping instruction, fetch and decode of 
the next instruction with execution of 

the present instruction. The simulated 
execution time on this machine is used as 
the basis for computing program speedups 
when several processors are employed. 

This is a realistic model for both speedup 
and costing estimates, for it corresponds 
to many present day medium scale machines. 


Within the constraints listed above, it is 
still possible to have wide variation. Determining 
which elemental operations can be started at any 
time is quite complex. To be valid a data depend- 
ency detection mechanism must not start an opera- 
tion in violation of the partial ordering imposed 
by the dependency graph; it need not however, 
start an operation that logically can begin. 

In order to keep hardware costs within bounds 
the designer of a computer may choose to use a 
data dependency detection mechanism which does 
not utilize the full potential for concurrent 
execution. 


Theoretical Ideal - M,- 


The purpose of this model is to establish 
an upper bound on possible performance improve- 
ment. For active instructions, subinstruction 
functions are started as soon as the necessary 
input data is available. In particular, depend- 
encies created by the addressing structure are 
ignored, and only ‘dependencies on actual data 
are considered. In terms of the addressing 
structure used in the simulation, displacement 
plus contents of a base register plus (optional) 
contents of an index register, the dependencies 
caused by unresolved addresses, as discussed in 
the last section, are ignored. This model, not 
only establishes an empirical upper limit, it 
also allows us to gauge the effect of two 
measures designed to increase parallelism 
potential: 


~ Fetch operands before the possibility of 
conflict over data is resolved and restart 
an instruction if a data dependency is 
later discovered. 
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- Increase the word length, allowing program- 
mers and compilers to use the added bits 
to contain data dependency information 
derivable from the symbolic (assembly) 
listing, but lost in translating into 
absolute machine code. 


Fully Parallel Computer - My 


The data detection mechanism is quite similar 
to that used in M), a subinstruction function 
for an active instruction is begun as soon as it 
can logically be started. The difference is in 
the way dependencies due to the addressing 
structure are treated. In Mo the (indirect) 
dependencies are not ignored as in M,]. Since 
elemental operations can proceed in an order 
quite different from that implied by the sequential 
program statement, the name "fully parallel" 
is justified. It is very important to note the 
complexity of the algorithms used to detect data 
independence in Mj and M9. Since any active 
can be dependent on any other active instruction 
which precedes it in the sequential program 
Statement, if the active instruction window is 
of size N, the number of comparisons which must 
be made to deverusne all startable elemental 
operations is O(N“). Thus the overhead cost of 
the detection mechanism grows at a faster rate 
than the hardware performing the actual instruction 
execution. 


Concurrent Execution/Sequential Detection - Ms 


One way to prevent the overhead cost of the 
detection mechanism from growing at a rate 
faster than the rate of growth of the arithmetic- 
logic units is to use a data dependency detection 
mechanism that processes requests to fetch or 
store an operand in a sequential manner. In 
model Mg the data dependency mechanism provision- 
ally approves a fetch or store if 


(1) the address of the fetch or store is 
known and 

(2) there is at most one dependency which 

prevents complete approval of the request, 

(e.g. the only reason a store to an 

address cannot be approved is that there 

is a fetch from the same address). 


The arithmetic-logic units in model Mg are 
assumed to be sufficiently complex that they can 
detect resolution of the one dependency. If the 
address of the fetch or store is not known or there 
are several dependencies then the request is held 
up, as are all requests following this request. 
When the request can be provisionally approved, 
processing of requests resumes. In addition to 
the general strategy just described, if an 
instruction fetches a value from a location and 
later stores an (updated) value back into 
the same location this is not counted as a 
dependency. This is perfectly safe since 
no value to store can be computed until 


after the fetch is complete. This, apparently 
minor addition is necessary if two address machines 
are not to be unduly penalized by their addressing 
structures. (Note the earlier discussion of overly 
restrictive instruction formats.) The actual 
execution of instructions goes on concurrently in 
multiple execution units in Ms. 


Mechanisms of the type described here have been 
used on a number of computers. Certainly the most 
widely known is the scoreboard on the CDC-6600 [10}. 


Overlapped Computer - My 


The philosophy behind this design is that the 
execution of an instruction can be divided into 
phases, which the instruction progresses through 
sequentially. On code with no dependencies an 
instruction is in phase N, the next sequentially 
specified instruction is in phase N-1, the next 
sequentially specified instruction is in phase N-2, 
etc., and at the end of every machine cycle all 
instructions advance to the next phase. N is 
called the degree of overlap. Checking for 
dependencies is quite simple, since the activity in 
each phase is proscribed. Uneven execution times 
and data dependencies cause delays by not permitting 
an instruction, and all instructions that follow it, 
to advance to their next phase. The arithmetic- 
logic unit can be divided into several phases 
(pipelined) if this seems appropriate. Even many 
modest computers are overlapped to some degree, 
instruction fetch and decode is overlapped with 
instruction execution. In this simple case the only 
dependencies possible are due to self modifying 
code (prefetched instruction no longer correct) 
or branches taken (wrong instruction prefetched). 
Normally a delay is encountered while the correct 
instruction is refetched. The UNIVAC 1100/80 
and IBM 370/168 are examples of machines in which 
a high degree of overlap is used to gain 
significant speedups. 


Results and Conclusions 


A simulator for the models described in the 
preceding section has been developed and run on 
a number of benchmarks. The benchmarks, written 
in FORTRAN and executed on a UNIVAC 1106 availablje 
to the staff at the Sperry Research Center, were 
chosen from programs currently being used in un- 
related research areas. The programs studied in 
depth were a constrained minimization problem with 
integer variables, a model of a physical problem 
which uses double precision floating point instruct- 
ions, and a system print routine. The simulator 
uses program traces generated by actual program 
executions. The results of a number of runs are 
summarized in the graphs at the end of the paper. 
(Figures 3-5). The horizontal axis represents 
the number of active instructions. The vertical 
axis represents the potential speedup factor 
due to concurrent execution of logically independent 
elemental operations. . The speedup factor is 
computed relative to a computer (with the same 
component speeds) that executes instructions in a 
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Strictly sequential manner, except that instruction 
prefetch is performed. | 


The Effect of Sophisticated Addressing Structures 
on Potential Performance Improvement 


The need to be concerned about the indirect 
effects of unreselved addresses has been mentioned 
in several places in this paper. There was no 
a priori way of determining the degree to which 
these dependencies would affect performance. 
Comparing the simulation results of M, and M 


‘allows us to conclude that this problem is rela- 


tively minor. While the reason the performance 
degradation is so small is not completely under- 
stood several factors appear to contribute ; 


- index registers are frequently loaded 
in advance of need, so no delay is 
encountered when an address using index- 
ing is computed. 


- when a fetch is delayed in Mo because of 
the presence of an unresolved address 
it is often simultaneously dependent on 
the actual data as well. This dependence 
on the data causes a delay in M]. (This 
was discovered by monitoring some of the 
queues internal to the simulator.) 


Several consequences of the small nature of 
the performance degradation should be noted. 
The two techniques for improving performance 
discussed with model M,, recording within the 
instruction information about data dependencies 
derivable at compile time and prefetch of data 
with refetching if a dependency is found, do not 
provide noticeable improvement in speedup 
potential. An important conclusion can be 
drawn in the area of system security by naturally 
extending these results. The use of numerous 
base registers to allow implementation of small 
segments and rapid segment switching does not 
appear antithetical to the goals of using 
parallelism to gain high performance. The reasm 
for this is that while changing the contents 
of a base register is done more frequently in an 
environment that supports flexible use of small 
segments, it is not done as often as loading or 
modifying an index register; thus having little 
additional impact. 


Instruction Starvation and Differential Execution 
Time 


The continued gradual rise in the potential 
performance of M, and as a function of the 
number of processors, after the performance of 
the simpler designs of Mg and My has leveled off 
might make it attractive to those users demanding 
the highest possible speeds. The sensitivity of 
the curve of performance versus number of active 
instructions to changes in the relative speeds of 
some of the basic operations becomes an interesting 
question. Two experiments were performed: In one 
the relative speed of sequential instruction fetch 
time to data fetch time was varied. In the other 
the relative time to perform a long instruction 


(e.g. double precision floating point multiply) 
to that of a short instructim (e.g. add a fixed 
point value to a register) was varied. 
are summarized in Figures 6 and 7. 


Figure 6 shows that unless a very fast in- 
struction buffer is provided the performance im- 
provement that can be expected from use of My 
is limited due to instruction starvation. When 
the observed average instruction execution time 
is less than the rate at which instructions 
can be supplied, performance will level off at 
about the number of processors needed to main- 
tain this average instruction execution time. 
The leveling off is gradual, of course, because 
the amount of parallelism in the instruction 
stream varies. 


Figure 7 indicates that obtaining large 
speedups is difficult for those users who desire 
it the most, users with large simulation programs 
that use many double precision floating point 
operations (e.g. weather prediction programs). 
When an instruction with a long execution time 
falls on the critical path of the dependency 
graph, the activities that can be performed 
Simultaneously with the execution of this 
instruction are soon completed and the system 
must wait for the instruction to complete. 


Practical Data Dependency Detection Mechanisms 


The evaluation of the effectiveness of 
practical mechanisms for detecting subinstruction 
independence was a major goal of this research. 
The graphs at the end of the paper show that 
sequential (provisional) approval of fetches and 
stores, as described in machine model M3 and 
the overlapped approach as described in model My 
limit the amount of parallelism that can be 
utilized. In order to validate the performance 
improvement recorded for model M3, the author 


investigated the experiences of users who have run 
large FORTRAN codes on both CDC-6600 and CDC-6400. 


The two machines have the same instruction set. 
The CDC-6600 corresponds closely to model Mg, 
while the CDC-6400 corresponds to the base for 
comparison, a computer with only one active 
instruction. After taking into account the 
differences in component speeds, the performance 
improvement observed for the CDC-6600 over the 
CDC-6400 is similar to that obtained in this 
simulation [8]. The widening gap between M, and 
Mea - Mg,as the number of active instructions 
increases beyond six, indicates that it is 


difficult to find algorithms to detect parallelism 


while keeping the cost of detection hardware 
down. 


It appears that for all but specialized 
scientific problems that can be formulated 
in terms of vector or array operations, cost 
effective program speedups by architectural 
techniques cannot be pressed much beyond 
current implementations. 


The results 
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Some 


reasons for this are: 


- The cost of the mechanism for detecting 
dependencies used in M, would be much 
greater than that used in Mg or My, 
if it could be designed at all. 


- The regularity introduced into Mg and Mg 
by the natural sequencing of certain 
elemental operations permits more even 
utilization of memory resources, keeping 
down the cost of the memory interface. 
It appears that more realistic modeling 
of the memory interface would impact M 
more than M, or My. 


- The benefits of M, over Mg or Mg are most 
Significant when the number of active 
instructions is large. As the number of 
active instructions grows the problem 
of handling conditional branches 
becomes more acute. 


- The sensitivity of Mj, to a number of 
factors (e.g. sequential instruction 
fetch times and execution time for 
long instructions) implies that a 
computer developed around model M 
might very well show performance 
lower than anticipated from the sim- 
ulation results. 


- The rise of the potential performance 
curve for M, is gradual after six 
processors. The slow rate of increase 
means a marginal return for each 
processor added (at more or less 
constant cost). 


It is not possible to conclude whether 
the Mg or the My design is more cost effective, 
in general. The effect of minor variations in 
the addressing structure, the instruction set and 
the job mix imply that a detailed analysis is 
needed in each individual case. 


References 


(1) Brinch-Hansen, P., "The Programming Language 
Concurrent Pascal", IEEE Transactions on 
Software Engineering, SE-1 no. 2, June 1975 


pp. 199-207. 


Coffman, E.G. Jr. and R. L. Graham, "Optimal 
Scheduling for Two Processor Systems," 
Acta Informatica, 1972, pp. 200-213. 


(2) 


(3) Giroux, E.D., "A Large Mathematical Model 


Implementation on the Star-100 Computers", 


presented at the Symposium on High Speed 


Computer and Algorithm Organization 
(proceedings to appear). 


(4) Graham, R.L., "Bounds on Multiprocessing 


Timing Anomalies.", SIAM Journal of Applied 
Mathematics, Vol.17, no.2, March 1969, 


(5) Keller, R. M. “Look-Ahead Processors’, (9)  Riseman, E.M. and C.C. Foster, "The 


Computing Surveys Vol. 7, no. 4, December Inhibition of Potential Parallelism by 
1975, pp. 177-195. Conditional Jumps", IEEE Transactions 
| on Computers, C-21, no. 12, December 1972, 
(6) Kuck, D. J., Y. Muraoka, and S.C. Chen, pp. 1405-1411. 


"On the Number of Operations Simultaneously 
Executable in FORTRAN-like Programs and their (10) Thornton, J.E., Design of a Computer-- 


Resulting Speedup", IEEE Transactions on The Control Data 6600, Scott, Foresman 
Computers, C-21, no. 12, December 1972, and Company, Glenview, Illinois, 1970. 


pp. 1293-1309. 
(11) Tjaden, G. S. and M. J. Flynn, “Detection 


(7) Lamport, L., "The Parallel Execution of DO and Parallel Execution of Independent 
Loops", CACM, Vol. 17, no. 2, February 1974, Instructions", IEEE Transactions on 
pp. 83-93. Computers, C-19, no. 10, October 1970, 


pp. 889-895. 
(8) Link, B. Sandia Laboratories. Private 
Communication. (12) Tsuchiya, M. and M.J. Gonzalez, "Toward 
Optimization of Horizontal Microprograms", 


IEEE Transactions on Computers, C-25, 
no. 10, October 1976, pp. 992-999. 


® LAGI (11 time units) LX X@J (11 time units) 


ADD AG,#1 (7 time units) L A1,A(X@) (12 time units) 


| 
somos | ADD A1,A@ (8 time units) 


ADD AG@G,#-100 (7 time units) S A1,A(X@) (11 time units) 


JNP A@,$99 (5 time units) speedup = 1.91 

FORTRAN Sample Machine Code Key" 

l=1+1 (1) L AGI | L = load accumulator 
(2) ADD AG#1 ADD = fixed point addition 
(3) Ss AGI S = store accumulator 

AlI=A(d)+1 (4) LX XJ LX = load index register 

| (5) ob | A1,A(X@) JNP = jump non-positive 

(6) ADD A1,A6 #i = immediate operand 
(7) s A1,A(X0) $99 = label | 

IF (I.LE.100)GOTO 99 (8) ADD A@,#-100 Ai = accumulator i 
(9)  JNP AG,$99 Xi = index register i 


A(Xi) = address computation using indexing 
(array A is of type INTEGER) 


FIG. 1 A Typical Dependency Graph 


F xX@° 


| 


CA A(X@) + 


| 


F A(X@)° 


Ss a! 


speedup = 2.41 


CA = compute address (4 time units) 


F = fetch (6 time units from memory 
1 time unit from register) 


S = store (6 time units to memory 
1 time unit to register) 


Im = prepare immediate operand (4 time units) 
Add = (2 time units) 
Test = (1 time unit) 


LPC = load program counter (1 time unit) 


FIG. 2 A Refinement of a Dependency Graph 
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FIG. 3 Parallelism Potential — 
Constrained Minimization Problem 
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FIG. 4 Parallelism Potential — 
Simulation of Electron Scattering 
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FIG. 5 Parallelism Potential — 
FORTRAN Print Routine 


4 


SPEEDUP 
FACTOR 


3 
r= 1.0 


SEQUENTIAL INSTRUCTION FETCH TIME 
= DATA FETCH TIME 


DATA FETCH TIME ~ CACHE MEMORY SPEEDS 


12345678910 15 20 25 30 
NUMBER OF ACTIVE INSTRUCTIONS 


Constrained Minimization Problem 


FIG. 6 The Effects of Instruction Starvation 
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Abstract -- We present a machine structure which has as its base 
language, encodings of a general class of parallel programs 
known as data flow schemas. The machine is unique in that it 
supports a rich subset of the schema class that includes 
procedures. The procedure implementation scheme is particularly 
novel in that the creation, execution and termination of procedure 
activations are distributed over the machine. The additional 
hardware required to handle procedures is small and smoothly 
incorporated into existing data flow machine designs. 
Furthermore, the execution overhead is low. Fundamental to the 
scheme is selective copying of active parts of an invoked 
procedure. Runtime renaming of the copied parts of the 
procedure is used to maintain the identity of distinct activations. 


1. Introduction 


1.1 Motivation 


One general architectural approach to the task of 
improving machine performance has been the design of parallel 
processors. The traditional approach has been to design stream 
processors (25, 26] . These processors tend to divide into two 
general groups. There are those that echieve modest 
performance improvement but are quite general purpose, typical 
examples being the IBM 360, a SISD machine (Single Instruction, 
Single Data Stream), and the CDC star, a MISD machine (Multiple 
Instruction, Single Data Stream) [26] . There are also those that 
achieve striking performance improvement for specific types of 
computations, most notably the SIMD (Single Instruction, Multiple 
Data Stream) machines, such as ILLIAC IV or STARAN [9]. 


These approaches to parallel computation are limited by 
the traditional view that a computation is a sequentially ordered 
set of operations to be performed on a set of data. A more 
natural notion is that any operation can proceed when its 
operand values are available, and the destination of the results of 
the operation are able to receive them. The class of programs 
called data flow schemas captures this idea. 


1.2 A Parallel Representation of Computations 


A program schema in general is an abstract program, 
constructed from a precisely specified set of primitives according 
to a prescribed set of composition rules. Data flow schemas 
(DFS) are graphical models of programs that embody the idea 
that a component operation can be carried out as soon as all its 
Operands are available. More generally, control flow can be (and 
in DFSs is) completely determined by flow of data. The most 
important property of DFSs is that the computations modeled by 
DFSs though highly parallel are determinate [21] . That is, the 
outputs of any DFS program that terminates on a set of inputs, 
are functions only of the particular set of inputs and not the past 
history, or the particular computation sequence. 


Furthermore, DFSs in addition to exhibiting large amounts 
of parallelism, are syntactically modular, side-effect free, and 
quite general in expressive power. It has been shown 


(constructively) that DFSs are equivalent to tha class of flowchart 
schemas which are well known to be able to model ALGOL-like 
programs [15]. 


Thus DFSs are a well understood model! of parallel 
computation which exhibits a number of properties one might 


‘desire to have in a language executable on a new (parallel) 


machine architecture. Consequently, if a machine existed which 
could execute instructions that were an encoding of a DFS, and 
executed them in the manner prescribed by the data flow model 
(i.e. flow of control is determined by flow of data) we would have 
a parallel machine which had as its “base language” a data flow 
language [3] . Such a machine would exhibit many of the 
properties of the programs themselves, e.g. a high degree of 
parallelism and determinate computations. 


A number of the architectures proposed to date that 
directly execute encodings of data flow programs cannot handle 
procedures efficiently - or at all (2,7, 22,24] . In this paper we 
will present an extension to the data flow machine of Dennis and 
Misunas [7] which can execute data flow programs as procedures 
in a general, efficient and architecturally simple manner. Section 
2 of the paper is a review of the base language of the data flow 
machine. Sections 3.1, to 3.3, review those apsects of the 
machine’s operation that are central to our procedure 
implementation. The remainder of the paper discusses the 
implementation scheme itself. 


2. The Base Language 
2.1 Introduction 


The base language for the new machine, DFM, is a subset 
of a data flow language. DFL is itself just the set of interpreted 
DF schemas that have been thoroughly discussed in [5, 10, 11, 
13, 15] . We will not introduce the general schema class but 
briefly review the language. 


DFL is a generalization of the idea of execution of an 
operation as soon as its operands are available. A DFL program 
is a bipartite, directed graph whose nodes are 


1. Actors - sites of action 
and 
2. Links - conveyors of values. 


All the roots and leaves of the graph are links. We regard values 
as being associated with tokens which are placed on the input 
and output arcs of the links. An arc may hold at most one token. 


A node’s behavior or firing rule is characterized by a 
simple algorithm which depends on whether it is a link, or which 
of the four classes of actors it is a member - primitive 
computational function (PCF), SWITCH, MERGE gate or APPLY. The 
tiring rules for links and the first three kinds of actors are 
depicted in figure 1. We will defer the discussion of the APPLY 
actor until later. 7 
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(analogously for “F"tokens) 
Firing Rules 
Figure 1 


In general a node fires by absorbing a token from each of 
its input edges, performing a transformation on them, and placing 
a single token on the appropriate output edge. A node may fire 
only if the ouput edge to receive the value is empty, and all of 
the input edges of the node are occupied by tokens. An 
exception to this general rule is the MERGE gate. It observes the 
above output restriction but needs only two of its inputs to fire, 

as shown above. 


Given these primitives and the composition rules, we can 
construct fairly elaborate programs. Figure 2 is a data flow 
program that computes factorial(x) (x > 0). 


A Dats Flow Program Computing Factorial 
Figure 2 


The actor IDENT performs the identity transformation, the x 
actor’s output is the product of its inputs, and the «1 actor 
decrements its input argument. | 


2.2 Data Flow Procedures 


In sequential programming languages, the abstraction 
obtained by using procedures is a useful one. One would expect 
the same sorts of advantages to accrue in a parallel programming 
language. We will incorporate them into DFL by generalizing the 
notion of a program and adding one additional actor type. A DFL 
program is a forest of properly formed DFL programs called 
procedures. We associate a procedure name with each of the 
subprograms in the forest. One of the subprograms, called the — 
main program, has the name Pp associated with it, and it is the 
only procedure to receive tokens on its input arcs from an 
“outside agent". That is, it is the entry point of the program in 
the sense that it receives the input values. 


We introduce a new actor which we call an APPLY actor. It 
has m inputs, n outputs and is labelled with a procedure name P, . 
The APPLY actor when enabled to fire, (conceptually) substitutes 
for itself a copy of the procedure whose name matches that of 
the actor. This action taken by the actor is defined only if a 
procedure exists with name P, and the procedure has the same 
number of inputs and outputs as the actor. 


To completely understand how the APPLY actor works, we 
must define the enabling condition, the mechanism for 
transmitting input values to the copied procedure, and the return 
mechanism for results. There are two alternatives we consider: 


1. The APPLY actor is enabled, as soon as its first 
argument token arrives. It then copies the procedure (a 
procedure copy is called an instantiation) and passes argument 
tokens as they arrive. An argument is passed by absorbing a 
token from an input arc to the APPLY, and placing a copy of it 
onto the procedure instantiation’s corresponding input link’s 
output arc. The actor copies output values from the procedure 
copy as soon as they become available on the output links input 
arc and the corresponding link of the calling program is empty. 
When values from each output link have been returned the copy 
is destroyed. | 


2. The APPLY actor is enabled only when all its argument 
values have arrived and its output links are empty. When these 
two conditions are met, the procedure is copied and the argument 
tokens passed. When all result tokens are available they are 
copied by the APPLY from the input arcs of the procedures 
output links to the output arcs of the APPLY actor. The copy of 
the procedure is then destroyed. 


In both cases we assume that the n input links are numbered left 
to right, 0,1, ... , n-l, for both the APPLY actor and the 


procedure it invokes. We associate the i" tink of the APPLY with 
the i*” link of the procedure it invokes. We treat the m output 
links in a similar fashion. 


The semantics of the two approaches to procedure 
activation are quite different. In the first approach an APPLY 
actor can be thought of as replaced inline by the graph of the 
procedure it invokes. In the second approach an APPLY actor 
behaves exactly like a PCF actor, except that it may have multiple 
outputs and computes a function that is not necessarily in the 
repertoire of PCF’s. The first approach we call the immediate 
copy rule (ICR). The second is called the deferred copy rule 
(DCR). The DCR most closely corresponds with one’s idea that a 
procedure is some sort of a functional abstraction, whereas the 
ICR is more like a macro expansion. The DCR has the advantage 
of simplicity of implementation. It also lends an additional 
homogeneity to the set of actors, since its enabling rule is that of 


a PCF. However, the ICR clearly allows greater parallelism than 
DCR. Furthermore, it is easy to simulate the DCR for an APPLY 
actor using an ICR APPLY actor and gates. Consequently, we have 
chosen to incorporate the ICR form of the APPLY in DFL. 


The ICR has one potential problem. Suppose an argument 
token arrives on the i'* link and the execution of some procedure 
is initiated. Consider what happens if another argument arrives 
on the i'® link before the previously invoked copy of the 
procedure terminates. In order to be consistent we must create 
another copy of the procedure and pass this newly arrived token 
to it. Thus the APPLY actor must "keep track of” an arbitrary 
number of concurrently executing instantiations of the procedure. 
Needless to say, this poses some serious implementation 
questions. If we can demonstrate for every APPLY actor @ that 


VY (i, j) [TUi) - Mj $1, 0 <i, j < number of inputs of A 
where 7i) = number tokens that have arrived on 
the i'* input of @ 


for any reachable configuration of a data flow program, then we 
can show for any apply actor @ of a DFL program that at most 
One instantiation can exist at any time, and consquently the state 
information is bounded. In general, data flow programs do not 
exhibit this behavior. However, certain large syntactic sub- 
classes of DFL programs have been shown to satisfy the above 
arc condition [19, 20] . One such class is known as well formed 
data flow programs. Beside having the above property, a well 
formed data flow procedure, when it terminates, will be in its 
initial configuration. In particular, the only tokens left on the arcs 
of a terminated procedure, will be the initial “F" tokens on the 
gating inputs of MERGE gates of iterative loops. This property 
will be important for the implementation. 


The introduction of procedures concludes our discussion of 
DFL. The reader may note the DFL language does not contain 
structures as a token type, nor does it have any structure 
manipulating actors. These features have been omitted for 
brevity, since they are adequately discussed in [5], and [20]. 


3. The Machine 
3.1 Introduction 


There are obvious hardware analogs to each of the the 
actor types of a data flow program. This suggests that it is 
possible to execute an encoding of a data flow program by 
constructing an exact hardware realization of the program [14] . 
This approach has several serious liabilities [19]. Most important, 
the resulting machine would have much idle hardware, since any 
direct implementation of actors would necessitate placing 
substantial computational power at each hardware actor. Also, 
such a machine would not be readily configurable. For all these 
considerations we reject out of hand any direct implementation of 
data flow program graphs. 


The description of the machine architecture we present 
will be at a rather abstract level, and quite brief since our main 
purpose is to introduce the procedure implementation scheme. 
The interested reader is referred to [4, 7, 19] for details. Briefly, 
this new machine, DFM, is a member of a larger general class of 
system structures known as packet communication architectures 
[6] . Such systems consist of a collection of interconnected 
subsystems, which themselves may be packet communications 
structures. The subsytems are interconnected via a set of one 
way links called channels. They communicate with each other by 
sending messages through the channels using a well specified 
protocol. The messages transmitted are known as packets. 
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3.2 The Processor Pipeline 


The processor pipeline is composed of four parts, the 
instruction memory (IM), the arbitration network (AN), the 
functional units (FU) and the distribution network (DN) connected 
as shown in figure 3. | 


FUNCTIONAL 
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PACKET 
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Data Flow Processor With Procedure Capability 
Figure 3 


The IM is the heart of the machine. It contains a collection 
of nc identical units called cells. We can refer to a particular cell 
of the IM by a unique integer which we call a cell name. With 
the exception of APPLY actors, there is a one-to-one 
correspondence between a cell of an object program and an 
actor of the source program. Since we are not interested in the 
precise representation of cells, we will schematically represent 
them as in figure 4. 


Abstract Cell Representation 
Figure 4 


The OPCODE field specifies a function to be computed according 
to the type of the actor the cell represents. It is set at program 
loading time and remains fixed throughout execution. The ARG 
fields are also registers. Their contents are encodings of the 
values that reside on the input arcs to the actor whose 
representation was loaded into the cell. Clearly the ARG fields 
change their contents during execution. Notice that by assuming 
three such fields, only actors with three (or fewer) inputs are 
representable in the machine. The DEST fields contain addresses, 
which are encoded as register pairs which specify an ARG 


register of a cell. If a cell has fewer than three destinations, the 
unused DEST fields are set to a distinguished state notused. 
Similarly, if an OPCODE requires fewer than three arguments, the 
unused ARG fields are set to notused. Multiple destination fields 
are used to provide fanout of values. Since there are three 
destination fields, the maximum branch factor of a link of a 
representable program is three. 


When the function specified by OPCODE (OPCODE 
SWITCH) is computed, the results are sent to the ARG registers 
of the cells named in the DEST fields. For SWITCH’s, one of the 
destination will be selected on the basis of the Boolean argument 
of the cell. The cell encoding of the factorial program of figure 2 
is shown in figure 5. 
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Cell Encoding of Factorial Program 
Figure 5 


By convention, we let the 0" ARG field of a SWITCH correspond 
to the data input. The first ARG field corresponds to the control 
input. The O'" DEST field holds the T destination address, and the 
1% the F address. Similarly, the 0" ARG field of a MERGE is the 
T input, the 1°* is the F input, and the 2™ is the control input. 


Now that we have outlined the structure of cells, we must 
describe their operation. Figure 6 specifies the operation of cells 
whose opcode field is not MERGE. The operation of cells — 
implementing MERGE gates is an obvious modification. 


PLACE OUTPUT 
PACKET IN 
CHANNEL 


CONSTRUCT 
OUTPUT 
PACKET 


Cell Operation 
Figure 6 
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where an operation packet is a septuple of binary words 


{OpC, Vo, Vi» V2» do» dys da} 
opc = contents of opcode register 
v; = contents of register ARG, 

= null if register ARG; is notused 
d; = contents of register DEST; 

= null if register DEST; is notused 


Thus the cells of the instruction memory pass “messages” to the 
arbitration net requesting that an operation be done. Notice that 
cells have some processing capability - they are small finite state 
machines. However, this capability is quite primitive. The 
operation of the rest of the pipeline should be intuitively clear, 
and is omitted. | 


3.3 The Virtual Memory 


Now that the basic processor pipeline has been outlined, 
and its major characteristics described, we turn our attention to 
the virtual memory part of the DFM. With iterative and conditional 
programs and recursive procedures (to be described later) it 
becomes obvious that parts of a program are more “active” than 
others. Therefore, it is desirable to have a two level memory 
hierarchy. One level is the instruction memory, in which we keep 
the most active parts of a program. The other level is some sort 
of backing store, much larger than the IM, and lacking any 
processing ability. Hence it is relatively inexpensive. Cell images 
are kept in the backing store. Each cell image is a sequence of 
binary words which completely specifies the state of a cell. 


When a cell is referenced during execution that is not in 
the IM, its image is fetched from the backing store and placed 
into a physical cell (p-cell) of the instruction memory. (assuming a 
p-cell is free). If a cell ¢ is referenced that is not in the IM, and 
there are no free p-cells, a p-cell is selected for displacement. A 
p-cell is said to be pristine if it has received no operands, or is a 
MERGE cell and contains only and initial F token. A pristine, 
selected p-cell is simply discarded. Otherwise, the state of the 
selected p-cell, together with the name of the cell that is “in it” is 
stored in the backing store. This makes room for the cell image of 
¢ which is subsequently fetched and installed in the new, free p- 
cell in the IM Notice that the set of cell names now has the size 
of the backing store plus the IM, rather than just the IM 
Furthermore, a cell may be resident in the machine as an image 
either in the backing store or in a p-cell. Thus the virtual cell 
name space is much larger than nc. The command and control 
networks and the PMS implement the backing store [6,19] . To 
retreive a cell named ¢, a packet of the form {fetch, ¢} is issued 
by some controller in the IM to the command network. Storing is 
similar, but a packet tagged store is sent, and the packet also 
contains the state of the named cell. Retrieved cells are returned 
to the IM through the control network, again via a packet 
containing a cell name and the named cell’s state. 


3.4 Procedure Invocation and Activation Names 


There are several ways of invoking a procedure in data 
flow language that are consistent with the data flow model. 
Consider for example, a single input, k output procedure P. The 
effect of a token arriving at an actor @ labelled APPLY P is easily 
described. When a data token @ arrives on the input arc, a copy 
of the data flow graph for P is made, @ is absorbed from the 
input arc of @ and a copy is placed on the output arc of the 
input link of procedure P. As each of the k outputs for this 
activation of P is produced, it is passed from its output link to the 
corresponding output link of the APPLY and hence to a successor 
cell of G. (For convenience in discussion, each cell in a data flow 
program is assumed to have a unique name associated with it. 
The name of a cell will be shown in the figures as <name>: next 
to the representation of the cell. The name of a_ cell is used in 


the data flow processor to identify it. For example, to route 
result packets to it or to retrieve it from memory.) To be 
syntactically correct, P must have one input link and k output 
links. 


The heart of the procedure implementation scheme is the 
relocation box together with a special functional unit that can do 
“byte” manipulations on operation packets. Anticipating 
mechanisms to be proposed later, some of the functions of the 
relocation box (RB) will be described. It is assumed that every 
actor in a data flow program as represented in the processor has 
a unique cell name except for APPLY actors. Further, during the 
course of execution (where and how will be described later) a 
cell name may have a suffix appended to it - these suffixes will 
serve to distinguish activations. At any time during the execution 
of a program, there will be a one to one correspondence 
between used suffixes and procedure activations. In the following 
discussion we will seperate a cell name from a suffix by a “.”. 


The RB’s operation is quite simple. Upon receipt of a fetch 
packet from the memory command network {fetch, ae}, it 
passes the packet {fetch, a} to the packet memory system. 
When cell image of @ is returned by the PMS to the relocation 
box, all the names in its destination fields are changed to have 
suffix ¢« The RB then passes the resulting cell image back 
through the memory control network to the instruction memory. 
Finally, it is assumed that with the sole exception of the 
relocation box and one special functional unit, no other 
component of the data flow processor of figure 3 can distinguish 
if a cell name has a suffix appended or not. That is, if the 
distribution network for example, receives a packet with a 
destination cell name a@.¢ it sends a result packet to cell ae (the 
dot seperating the name from the suffix is included merely as an 
aid to the reader’s eye). The essential idea is that a complete cell 
name (i.e. a cell name plus an appended suffix, which we refer to 
as an execution name) is treated everywhere but the relocation 
box and the distinguished functional unit as a single entity - a 
cell designation. 


Before introducing the complete procedure mechanism, we 
will first demonstrate how a single input/single output procedure 
is invoked on the machine. The APPLY cell in particular has the 
format shown in figure 7. 
| =P ‘| DEST.Q 


An Apply Cell 
Figure 7 


P is the name of the procedure to be invoked, DEST.Q is the name 
of the cell that is to receive the result value, and the empty ARG 
register is to receive the argument for the procedure call. The 
implementation of a single input-single output APPLY actor is 
straightforward. When the operand a arrives, the APPLY becomes 
enabled, and transmits the packet 

{APPLY, P, a, DEST.Q, NULL.Q, NULL.Q} which the arbitration 
network routes to the special functional unit that processes 
procedures. The FU on receipt of an apply operation packet 
creates a unique suffix « and then outputs two packets {a, P.¢} 
and {DEST.Q, RT.c}. (Every packet sent to a cell must also 
contain field addresses, that is, specifications of which register in 
a cell is to receive the value(s) conveyed by the packet. These 
will not be explicitly represented in the diagrams since it should 
be clear from the context which register of a cell is supposed to 
receive which value. Leaving out the field address will make the 
diagrams a bit less detailed and hence less confusing.) 


The first packet to arrive at the instruction memory 
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destined for the procedure activation P.e, causes the cache 
mechanism of the instruction memory to retrieve from the packet 
memory system a cell with name P.e. (Since ¢ is unique suffix, cell 
P.e can’t possibly have been in the instruction memory.) Due to 
the action of the relocation box, a copy of cell P will be retrieved 
and all its destination fields given the suffix ¢. Once the cell P.¢ is 
successfully installed in the instruction memory, actor P.¢ will 
then receive its operand a, and become enabled. Whatever 
computation is specified by P (the first actor of the invoked 
procedure) will be carried out and the resulting values will be in 
packets destined for cells Dope, D;.6 ... D6, where D,,... 0, are 
the contents of the destination fields of P. Clearly Doe... Dye 
will not be found in instruction memory and will be fetched from 
the packet memory system in the manner described above for 
P.e. And so execution of the ¢*” activation of P proceeds. 


To return the value computed by P we assume that all 
procedures are compiled so that their output value is to be sent 
to a cell named RT, which is @ reserved cell name. That is, rather 
than having an output link, programs are compiled to have a 
(uniformly mamed) output cell. Further, it is assumed that 
resident in the packet memory system is a cell with name RT, as 
shown in figure 8. 

RT: 


not used 


Return Cell 
Figure 8 


This cell belongs to no procedure (i.e. it is a runtime support 
cell). It will be retrieved by the second packet {DEST.Q, RT.¢} 
generated by the FU as a result of the APPLY cell firing. When it 
is finally resident in the IM, it will appear as in figure 9. 


DEST.Q | NULL. o 


RT.o : 


Return Cell in IM 
Figure 9 


With the compiler convention previously mentioned, when 
execution of the procedure is complete the packet {2, RT.¢} 
will be sent by the FU to the DN, where 2 is the output value of 
the ot” activation of P. Thus RT.¢ will be enabled and create the 
packet {RET, DEST.Q, 2, NULL, NULLe«, NULLe«, NULL«} which is 
sent to the appropriate function unit. This FU will then output 
the packet {f, DEST.Q} thus sending the result of the e'® 
activation of P in the correct destination cell, and returns ¢ to the 
pool of free suffixes. 


The question immediately arises as to whether the machine 
resources are all reclaimed. Clearly the activation name ¢ has 
been recovered. But what about the “residue” from the now 
terminated procedure activation? We remind the reader that the 
procedure is a well formed data flow program. Consequently, at 
termination, all of its cells are pristine. Thus because of the cache 
displacement (described in section 3.3) algorithm there will be no 
cells with a o suffix in the PMS or in the request queues of the 
PMS. So ail the ¢ residue is in the IM Since cell names 
(unadorned by suffixes) are all unique, this residue will be 
purged from the IM as room is required for new cells. No trouble 
arises should @ be reused as an activation name for the same 
procedure as its last use before that residue has been cleared 
because the left-over cells are all pristine. 


It may be objected that the IM should be immediately 


purged of ¢ residue to prevent displacement of “useful” cells 


from the memory. This can be done by having the FU emit a 


packet {¢, IM} in addition to {f, DEST.Q} when it processes 
the RET operation packet. Upon receipt of this packet, the IM 
cache manager purges itself of all cells with a @ suffix. When it is 
finished, it send the packet {RECLAIM, «} to the FU, which 
then places ¢ on the free list of suffixes. Alternately, we could 
construct the cache manager so that pristine cells have highest 
priority for removal. 


This procedure application scheme has several attractions. 
First it is simple. Overhead in terms of storage, or extra packets 
in the system, is almost zero. Few changes need to be made to 
the basic data flow processor, and those that are necessary are 
incorporated in a smooth and natural way. Also, notice that in this 
scheme the entire procedure is not copied, just the pieces of it 
as they become active. This is an important characteristic 
especially for programs with conditional constructs. For these 
programs, the amount of processing activity is not uniform over 
all program actors. In particular the predicate of a conditional 
program II, will select either the “true” or "false" subgraph of II. 
It will never be the case that both subgraphs (of a given 
activation) execute. Thus to load both of them into the instruction 
memory is wasteful of instruction memory space and memory 
control network bandwidth. Finally, procedures are compiled no 
differently than programs, thus allowing (without recompilation) 
the use of any data flow program as a data flow procedure. This 
will be discussed at greater length in section 3.7. 


3.5 More Elaborate Schemes. 


The primary deficiency of this scheme for procedure 
implementation is that it supports a rather primitive form of the 
APPLY actor - only one input and one output. If this sort of 
procedure call were incorporated in a data flow processor which 
could manipulate data structures, this deficiency would not be so 
bad. Then multiple input values and multiple output values could 
then be encoded as structures. However, such a form of 
procedure invocation would be undesirable because it would limit 
the degree of parallelism achievable. After all, there is no 
inherent reason for returning all the outputs of a procedure 
simultaneously. If a procedure produces k outputs, to wait for all 
of them to be computed, assemble them into a structure, return 
the structure to the calling routine, disassemble the structure, 
and then to use the resulting k components, restricts the amount 
of concurrency achievable in a program. It also incurs the 
overhead of assembly and dissassembly of structures. One would 
like to pass each of the k output values to its destination in the 
calling procedure as it is generated. 


Similarly, one would like to have multi-argument functions. 
Passing an £ component structure with the 2 argument values to 
a procedure as its components is undesirable. Again there is a 
loss of parallelism. A particular subgraph of the data flow 
procedure may require only a subset of the 2 input values to 
start execution. Thus to inhibit passing of any argument values 
to the procedure until all of them are available, seriously limits 
the achievable level of concurrency. 


The naive approach to implementing multiple input/multiple 
output APPLY (in the context of the previous discussion) is simply 
to have £ apply cells for a £ input APPLY actor each receiving 
One argument vaiue: 


APPLY 


[eT 
not used 


APPLY Celi for Multiple input APPLY 
Figure 10 
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All fields are as in figure 7 except P; is the j* input link of 
procedure P. The compiler is assumed to write “code” to send the 
j' argument value to cell AP, . We also place some number n of 
RETurn cells in the PMS, with names RTo, . . RT,.;. When the FU 
processes an APPLY operation packet it issues packets that 


retreive k return nodes RTp¢ ,... RT,_}.6 and supplies each with 


_ the appropriate destination name. Unfortunately, this scheme 


does not work. The functional unit will assign a new activation 
name (new suffix) to each cell AP; . This will guarantee incorrect 
operation for ail but the most contrived programs. Even if this 
could somehow be patched, there is also no way to return output 
values correctly since each cell AP, would cause a new set of 
RETurn cells to be fetched. Finally, there is no mechanism for 
freeing suffixes once a procedure instance terminates. 


To correctly invoke a procedure we need to guarantee that 
for each set of input values to a particular APPLY actor only the 
first argument value causes generation of a packet which causes 
the functional unit to create a new activation name. Furthermore, 
we must make certain that the other argument values that are 
sent to the APPLY actor are all passed to the same procedure 
activation as the first. Last, we must be sure only one return 
mechanism is set up. To do this we introduce a new actor which 
we call SEQ (for sequence). It has 2 inputs and £ outputs. 


Operation of the SEQ actor is simple. Upon receipt of any 
one of its inputs, say on input arc j, it produces an output value 
which is a unique suffix name. This value is then passed out on 
each of SEQ’s output arcs. No further action is taken until tokens: 
arrive on inputs 0,1, ... j-1, j+1,... 2-1. When this state occurs, 
one token on each of the input arcs i (i # j) is absorbed and no 
output token is produced. The actor (and cell) then return to the 
initial state and the above action is repeated. Figure 11 shows 
two examples of possible firing sequences of a SEQ actor. 


SEQ Actor Firing Rule 
Figure 11 


Notice that while the times at which outputs are produced are 
quite unusual compared to the other data flow actors, only one 
output is produced for each set of 2 input values. 


To allow freeing of activation names we introduce a new 
actor type FREE. Like SEQ, the FREE actor is not available to the 
programmer. It is a “runtime” support actor that allows proper 
activation name maintenance. This new actor has been introduced 
(just like SEQ actors) as a convenient way of showing how a 
procedure call and return is implemented. It is not an addition to 
DFL, but merely a way of presenting the details of the call-return 
mechanism that preserves a one to one correspondence between 
cells and actors. The FREE cell receives as inputs copies of each 
of the output values of the procedure instance of which it is a 
member. When all the output values have been produced the 
FREE cell is enabled. It then outputs a packet {FREE, 2p, 2), Mo. 
- « NULL, NULL e, . . NULL.e} which is routed to the functional 
unit that handles APPLYs and RETurns. Upon receipt of this 


packet the functional unit frees the activation name @ or outputs 
the packet {e,IM} as before. 


The reader should notice that unlike the RETurn cells, the 
FREE cell is explicitly included as part of the procedure 
application mechinism. This saves us the difficulty of determining 
at runtime the number of values the FREE cell must receive 
before becoming enabled. It may be objected that since the 
RETurn actors are not part of the procedure that the RETurn 
cells have no way to "know" what the name of the FREE cell is. 
Consequently, they cannot send result values to it. To fix this we 
send two destination names to the RETurn actors at runtime - the 
name of the cell that is to receive a result of the procedure call, 
and the name of the FREE cell. To ensure that these names are 
sent exactly once to the return cells, we add one more runtime 
cell, with an opcode APRET, whose function is described below. 
The full procedure call mechanism is shown in figure 12. 


Procedure Call Mechanism 
Figure 12 


The £ input APPLY actor now maps into £ APPLY cells of the 
form in figure 10 except that they have NULL destination field, 
and the third argument field is used to hold a suffix name. SEQ 
and FREE actors are added to maintain activation names, and RET 
and APRET actors to handle return values. However, at the 
source level all the user is aware of is that he is using a single 
actor, an £ input/k output. immediate copy APPLY. To see how 
this all works, let us suppose that the qt activation of some 
procedure produces the i' argument a; for an application of P, 
where P has £ inputs and k outputs. For concreteness, we depict 
the case of £ = 2, k = 3 in figures 13 through 15. 


The compiler will have generated code so that the actor 
that produces the value a has as its destinations the cells AP, 
and SQ. Thus the packets {a , AP.Q} and {a , SQ.Q}_ will be 
produced. These will cause the two cells shown in figure 13 to be 
fetched into the instruction memory. 


AP;.Q: 


Figure 13 
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The cells are depicted before the argument packets {a; , AP,Q} 
and {a@; , SQ.Q} have been delivered. SQ is now enabled and will 
generate the packet {SEQ,... a... , AP9.Q, AP;.Q, APRQ} 
which is routed to the special functional unit. This FU outputs the 
packets {6, APo.Q}, {6, AP;.Q}, and {e, APR.Q}. The packet 

{s, AP.Q} will enable the cell AP.Q, and as a result it will 
output the packet {APPLY, P, @, ¢, NULL.Q, NULL.Q, NULL.Q} 
which is then routed to the apply functional unit. It then outputs 
the packet {a, P.¢} and execution from the ith entry cell ot” 
activation procedes as described in the single input case. The 
other packets output by the functional unit - {¢, AP,Q} , 
j#i,j < 2- each cause cache faults, thus bringing in cells 
shown in figure 14. 


AP;.Q: 


ee a 


Figure 14 


When their arguments arrive, execution will proceed in the 
manner described for AP, . 


The returning of values from the ¢ activation of II is 
handled in a similar fashion as that described in the previous 
section on single input/single output APPLY. In this case it is the 
APRET cell that causes the RET cells to be fetched into the 
instruction memory. The packet {e, APR.Q} output by the FU 
retrieves the cell shown in figure 15. 


APR.Q: 


not used | DEST5.Q 


Figure 15 


Notice that the APRET cell has as a constant operand, the name 
of the FREE cell. When it is enabled it produces the output packet 
{APRET, @, FR, NULL, DESTp.Q, DEST;.Q, DEST2.Q} This packet in 
turn causes the FU to output packets 

{DEST >.Q, RTo.¢}, {FR.Q, RTo.6}, {DEST).Q, RT).¢}, {FRQ, RT;.6}, ... 
Thus the RETurn cells receive as parameters the names of both 
of their destinations. When a RET operation packet is processed 
by the FU, both the named FREE cell and the return destination 
cell receive copies of the output. We see that output values are 
returned as before. However, processing a RET operation packet 
does not cause the FU to terminate the activation. Instead it is 
the firing of the FREE cell that instructs the FU to terminate an 
activation as discussed above. 


3.6 Relocation Box Revisited 


To simplify the previous discussion we have purposely 
oversimplified part of the operation of the machine. We had said 
that the relocation box upon receipt of a fetch packet would 
always pass the packet to the PMS with the suffix stripped off 
the cell name. This is incorrect. If the machine really operated in 
this fashion, then it could never retrieve cells that had been 
displaced from the IM into the PMS. There are a number of 
solutions to this problem. However, selection among them is 
impossible without a more detailed model of the implementation 
of the packet memory system and the cache mechanism. Thus a 
full discussion is beyond the scope of this paper. The particular 
solution described here was chosen for its simplicity, and should 
not be thought of as an “optimal” solution. 


Upon receipt of a packet {fetch, ae} the relocation box 


first checks if ¢ is a valid activation name. If it is not, the RB 
signals a runtime error; otherwise it issues two fetch requests to 
the PMS. One is for a cell with name a, and the other is for a cell 
with name ae. The PMS will respond in one of several ways. 


1. If the PMS returns no entry for either request, then the 
RB signals an error. 2 | 


2. If the PMS returns a cell image for ae, and no entry for 
a, then the RB signals an error — | 


3. If the PMS returns a cell image for both, then the cell 
image returned in response to the request ae is passed back to 
the IM unaltered. 


4. If the PMS returns a cell image for the request a, and no 


entry for the request ase, then the cell image returned in. 


- response to the request @ is passed back to the IM with its 
destination fields altered as previously discussed. 


3.7 Names and Loading 


Another attraction of this implementation scheme for 
APPLY is that it does away with the need for an elaborate linking 
loader for data flow programs. Consider for example, a data flow 
program consisting of several procedures that have been 
independently compiled. One may assume for the purposes of 
discussion that the cell names (and hence the contents of the 
destination fields of a cell) correspond in some direct way to the 
memory locations of the packet memory system (PMS). Thus when 
loading the component parts of the program (i.e. the procedures) 
into the PMS two things must be done. Assume for concreteness 
that a procedure is compiled into a linear block of cells numbered 
from O and that cell numbers are cell names. When loading a 
procedure, a number equal to the cell number into which the first 
cell of the procedure was loaded must be added to all destination 
fields of all cells that refer to cells that are part of the 
procedure. Then all external references in a procedure - that is 
‘entrance cell names in APPLY cells - must be set to the correct 
value. This value depends of course on the location into which 
the referenced procedure is loaded. Notice however, that return 
names need not be relocated since they are “constructed” at 
runtime. Thus no correcting of the destination addresses of an 
APRET cell need be done. Indeed, the RETurn cells which actually 
transfer return values to the invoking procedure are not even 
part of the invoked procedure. Thus the two tasks of loading - 
fixing (by adding an offset) of internal references is easy, and 
fixing of external references is greatly simplified. Only half of the 
job must be done before execution, since the entry but not the 
return points must be relocated. 


The naming scheme provides a solution to establishing the 
“identity” of distinct activations amenable to efficient hardware 
implementation. Since a cell of an activation is uniquely identified 
by its name (assigned in the original compilation) and a single 


suffix, names are bounded in size (we assume of course that | 


suffixes and simple names are of fixed maximum length). _ 


3.8 One Last Change 


Creation and returning to the “free pool” of activation 
names (i.e. suffixes) is probably not an appropriate activity for a 
FU. The primary reason is that one would like to have multiple 
FU’s for processing of APPLY, RET, FREE and SEQ packets. 
Coordinating the creation and returning of suffixes to the free 
pool among several autonomous, asynchronously operating 
modules (FUs) is a messy task. It also introduces an overhead 
since the FUs coordination must take place through exchange of 
messages (packets) if we are to keep the overall machine 
structure consistent. Consequently, we propose the following 


modification to the scheme described above. We introduce an 
additional data path from each FU that processes applies etc. to 
the relocation box and from the relocation box to the those FU’s. 
Thus the machine structure is: | 


FUNCTIONAL 
UNIT 


FUNCTIONAL 
UNIT 


INSTRUCTION 
MEMORY 
ne CELLS 


DISTRIBUTION 
NETWORK 


ARBITRATION 
NETWORK 


COMMAND ~ CONTROL 
NETWORK NETWORK 


RELOCATION 
BOX 


PACKET 
MEMORY 
SYSTEM 


Final Machine Structure Supporting Procedures 
Figure 16 


When j* functional unit needs a new suffix, rather than 
generating it internally, it now sends the packet {NEWSUFFIX, j} 
to the relocation box. The RB responds by generating a new | 
suffix name and sending the new name (as a packet over the new 
data path) to the requesting FU. When the FU wishes to free a 
suffix ¢« it sends the packet {FREESUFFIX, ¢} to the RB which 
then returns @ to the pool of free suffix names. Thus in the new 
machine, assignment and freeing of activation names takes place 
_at a central location, hence avoiding the problems of maintaining 
a distributed list of free suffix names. The choice of using the 
relocation box to perform these functions is somewhat arbitrary, 
the main idea being to centralize activation name management. 
Where it is done is not critical. Note that simply letting each 

_ apply FU manage its own activation names cannot work even if a 
FU’s pool of names contains no elements in common with any of 
the others. The problem is that an additional mechanism must be 
provided to guarantee that a packet {FREE,¢} is routed to the 
functional unit that created «. Because of this complication the 
above scheme is inferior to the central name allocator. 


3.9 Procedure Variables 


Finally, it should be observed that the naming scheme for 
establishing unique activations is in no way dependent on the 
fact that the name of procedure in an apply cell that is to be 
invoked is constant. (Its placement in an operand field of the 
APPLY cell was intentional.) Consequently, without adding any 
additional mechanisms to the schemes proposed, procedure 
variables can be handled. One merely has to compile the APPLY 
cells with empty entries where the entrance cell names for the 
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invoked procedure was. Of course some other actor of the 
program now must send a cell name corresponding to an entrance 
point of a procedure to the appropriate APPLY cell in order for it 
to become enabled. 


One simple way to do this is to have each APPLY cell of 
the call mechanism receive the name of the procedure it is to 
invoke, rather than a node name of a particular entry point. When 
the i'* one is enabled it passes to the FU a packet of the form 
{APPLY;,, P, ¢, a, NULLQ, NULL.Q, NULL.Q}. The FU “knows” from 
the opcode and the first two arguments, that processing of this 
packet is supposed to send the ith argument to the e'” activation 
of procedure P. The FU can either 


1. Look up in a table set up at loading time the name of the 
cell that is the i'* entry point of procedure P. 
or | 
2. With a suitable compiler convention, compute the name 
of the cell given i and P. 


Though either approach works, the former has two advantages. 
First, since the call structure is set up at compile time, the 
number of inputs and outputs to the apply mechanism is known. 
This information could be incorporated in the encoding of the 
APPLY cells and also in the table. This would allow the FU to do a 
runtime check to see if the named procedure has the appropriate 
number of input and outputs. Second, using a table allows the FU 
to check if procedure P was defined. 


3.10 Conclusion 


We have presented here several schemes of increasing 
capability for implementing procedure application in a large class 
of data flow processors. All of the schemes implemented the 
immediate copy rule - that is, a data flow program with an apply 
actor is semantically equivalent to one where the APPLY has 
been replaced by the program graph of the procedure it invokes. 
This effect can only be achieved at runtime since recursive 
procedures and procedure variables are allowed. The schemes 
presented were efficient in the sense that the overhead in terms 
of the number of packets required to set up and terminate an 
activation was small. In addition only the pieces of the invoked 
procedure that were active were brought into the instruction 
memory. This helps economize the use of instruction memory 
space, especially in the case of data flow procedures with 
conditional components. 


This scheme was designed to be quite general, and minor 
modifications to it can achieve very different behaviour. For 
example: 


1. 
a) Forcing SEQ to wait for all its inputs before firing 
b) Altering the compiler so that the FREE cell of a 
return mechanism receives it values from the same sources 
as the RETurn cells 


changes the invocation rule to DCR. 


2. We restricted programs to have only one outstanding 
procedure activation at each call site (section 2.2). If we make 
some simple modifications, the scheme will work in a DFM with 
acknowledgement of packets [8] and this restriction can be 
eliminated. Another minor modification then allows the machine to 
handle streams as input and output data types [27] The details 
involve no major changes, but are beyond the scope of this 


paper. 


The mechanism for procedure implementation outlined here 
should be regarded as a scheme, and not a particular method. The 
things that are central to the approach are the ideas of 
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1. Creation of a virtual cell name space, using a 
hierarchical associative store. 


2. Creation and separation of different procedure 
instances through the runtime renaming of the cells that 
store the (encoding of the) procedure instance. 


3. Selective copying of the parts of procedures as 
they become active. 


4. Keeping the amount of state information that is 
necessary to characterize a procedure instance bounded 
and encoded in the cells by imposing syntactic restrictions 
on the base language. 
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Abstract : This paper presents the hardware 
specifications and figures of a parallel multi- 
processor system, currently under construction. 
The LAU system philosophy comes from the Single 
Assignment software concept and data-directed 
expression of problems. Up to now, a high level 
language, a machine language and a paper machine 
have been defined. A compiler and a simulator gave 
us Significant results enough to start the actual 
implementation of a prototype processor. The paper 
focuses on the advantages of data-directed control 
mechanisms and register-independent instruction/ 
data formats throughout the different parts of the 
processor : maximally efficient pipelining, full 
parallelism and asynchronism will be shown at the 
different stages of an instruction execution, with 
accompanying figures in speed, complexity and 

cost of their implementation. 


Introduction 


This paper presents the hardware specifications 
of a parallel multi-processor architecture, called 
"LAU system", whose control and sequencing 
mechanisms are directed by the computation of the 
program data. The data-driven control leads to 
some interesting properties and advantages in the 
design of the system, in terms of "independent" 
pipelining, parallelism and asynchronism. 


Since 1973, when the LAU Project started 
fl, 2] , the "Single Assignment" software concept 
[3, 4] has led to the definition of a high level 
language, a machine language and a paper machine. 
A compiler and a simulator gave us significant 
results enough to allow the actual implementation 
of the system. Much like other principles, mainly 
stressed by J.B. Dennis, [5, 6, 7] , all control 
mechanisms are provided by the "readyness" of 
data. In short, a statement in the (high-level- 
or machine —- ) language is executable as soon as 
its operands have their 'unique' values computed. 
Instruction sequencing is readily achieved by the 
data flow itself, and no longer by the relative 
order of statements. The single assignment rule, 
combined with these properties, guarantees program 
determinism, and leads to a "maximal parallelism" 
together with some interesting hardware design 
advantages. To be a little more acquainted with 
the LAU system philosophy, the reader is invited 
to follow an instruction stream, from its 
generation by the compiler to its execution in the 
different parts of the machine. In the first sec- 
tion of this paper, we present the high level soft- 
ware and hardware characteristics of the LAU 
system, some of them being still open problems. 
The second section outlines the architecture of 
one processor and the specific mechanisms implemen- 
ting the data directed flow of control. The third 
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section gives more details on each of the functio- 
nal parts of a processor. At each step will be 
given some figures (technological choices, cycle 
time, complexity) and the design properties which 
take full account of parallelism, concurrency and 
asynchronism derived from the single assignment 
and data-driven sequencing approach. 


High levels system principles 
The single assignment approach 


Beside methods trying to discover parallelism 
in programs and others which transform sequential 
code into pseudo-parallel executable sequences, 
some radically different ways have emerged which 
start from the problem statement itself. The 
parallel program schemata and the data flow 
approach fall in this category. The single assign- 
ment rule is another concept [3, 4] that applies 
from the problem analysis to its execution on a 
parallel processor system. This rule states that 


an object may be assigned a value at most 
once during program execution 


Implications of single assignment are immedia- 
Le = 


1. a statement, say X =A + B, is "executable" 
as soon as its operands (A, B) are computed 


2% it may be executed at any time later, in a 
way totally independent from its location in 
the program. 


The parallelism expressed by the single assign- 
ment rule (S.A. rule) corresponds to the inherent 
parallelism as stated in the problem. Due to the 
S.A. rule, instructions are guaranteed to deal with 
the unique values of operands, which msures program 
determinacy. 


However, executable instructions may be per- 
formed in any order; this is a definite advantage 
for an implementation which will not be concerned 
with instruction sequencing at the hardware level. 


LAU System level O 
made only) 


the general system (paper 


An S.A. program may be viewed as a collection 
of tasks whose activations are data-driven : 
when input data are evaluated, a task supervisor 
may decide to route the task onto an idle processor. 
Outputs of the task allow the detection of the 
task termination and the activation of other tasks. 
The following software/hardware diagram describes 
the LAU system at this level : 


SECONDARY 
STORAGE 


TASK MANAGER 


oar =a 
COMMUNICATION SYSTEM j 


Hh 


HARDWARE 


SOFTWARE 


Figure | level O LAU system architecture 
Independent tasks may run concurrently on 
different processors; the task manager has only 
to keep track of the task execution by examining 
the data directed tree produced at compile- 
time. Notice that this scheme might be extended 
to a set of different jobs, whose task trees 
would be controlled by the task manager. A 
processor Pj, enters the active state when loaded 
by the data and instructions of a task. As data 
values are unique, the execution time is irrele- 
vant for determinacy. When task outputs are 
computed, the processor informs the task manager 
which stores data produced by the task, and 
checks for new tasks to be activated. This higher 
level of the LAU system has not yet been imple- 
mented nor studied in full detail. Open problems 
are still to be evaluated : the expression of 
problems by single assignment data flow tasks, 
the compilation mechanisms that could break a 
program into tasks suitable in the memory space 
of a processor, the workload on the communication 
system and the task manager. 


~ LAU system level 1 : the LAU processor - 
The high level language : 


From now on, we come to the real things 
with the software and hardware at the processor 
level. 


As for the software part of the system, a 
high level language has been defined. Statements 
in the language are syntactically classical, but 
semantically different from usual ones in sequen- 
tial languages : every statement is an assign- 
ment statement, and produces values for each of 
the objects to be computed within its Scope. In 
the following program : 


S1 INPUT (CINMATX) ; 


S2 N = 128 

S3 EXPAND/16 I= 1, N: 
LOCAL : TEMP1; TEMP2; 
TEMP! = INMATX (1) - INMATX (I- 1) 
TEMP2 = INMATX (I) + INMATX (I- 1) 
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- an instruction may enter 


MAT(I) = TEMP! x TEMP2; 

END EXPAND; 
S4 MATS = VSUM (MAT FROM 1 TO N); 
S5 OUTPUT (MATS); 


S1 is the unique statement assigning INMATX, 
while S3 is the only assignment statement of MAT. 
To be executable, S3 has to wait for the computa- 
tion of N, INMATX, i.e the completion of S1 and S82. 
Notice that Sl and S2 are "ready" instructions, 
which will be generated by the compiler. The first 
block will compute MAT (1), MAT (17), ..., MAT(113) 
the sixteenth one MAT (16), MAT (32), ..., 

MAT (128). Once activated, each block runs con- 
currently with its 15 brothers, and parallelism 
will range between 16 and 32. When S3 is terminated 
(i.e when MAT is computed), S4 will be activated, 
and finally S5. 


Open problems in the high level language 
include some properties that can be found in other 
languages : very high level data structures, 
creation of user defined types. The reason is that 
we wanted a compiler as soon and as efficient as 
possible. 


The LAU machine language 


It is a single assignment language, too, and 
must imbed the data flow control expressed at the 
upper level: 


its execution cycle as 
soon as its operands are ready 

only link between the 
and those using it 


~- the result object is the 
instruction computing it 
as an operand. 


Ready instructions should run independently 
and free from hardware constraints (other than 
those imposed by the technology at the logic gate 
level), such as the organization and the number 
of processing units, management of requests, 
register assignment and control, data organiza- 
tion in memory. All these problems are totally 
irrelevant in our high level approach. The 
definition of the machine language has led to the 
following formats 


- Data format : 


ee ee ee 


LINK 1 LINK 2 Cd 


value 


LINK 1 and LINK 2 refer to instruction ad- 
dresses using the data as an operand. Cd is a 
control tag bits indicating the computation of 
the data. 


- Instruction format (for simple computational 
instructions) 


OPCODE ; _— 


CO Ci C2 


UPDATE 


Further details on the machine language 
can be found in [2,9] . These features 
imply the following comments : 


- the instruction format is large (64 + 2 bits) 
but coxresponds to approximately three classical 
machine instructions (LOAD OP!, ADD OP2, STO RES) 
An instruction will occupy one memory word. The 
data format is larger than a classical one which 
does not contain instruction Links. 


- the instruction has a three-address format, 
yelding two interesting properties for its execu- 
tion and the design of the machine : an ins- 
truction may be executed in any one of the pro-_ 
cessing units of the machine, and, more, the 


machine may be built with any number of processing 


units that can be increased, decreased, or support 


degradation without program change at the compile 
time. 


THus, the machine language preserves the 
parallelism expressed at the high level pro- 
gramming stage, and does not introduce hardware 
constraints for the definition of the machine. 


The LAU processor architecture 


The LAU processor consists of three functio- 
nal boxes, interconnected by data and control 
busses as follows : 


2p ae 


HOST 
MINICOMPUTER 
Sean : (= TAU PROGRAMS 
- UTILITIES : COMPILER, 
SIMULATOR, TRACE, 
INPUT/OUTPUT 
16 
FRONT PANEL 
INTERFACE MONITORING 


1/0 CONTROL, LOAD 


LOAD / 
UNLOAD 
BUSSES 


23 
52 


CONTROL 
SUBSYSTEM 


(INSTRUC- MEMORY 


SUBSYSTEM 
RQST INST. DATA 
BUS 29 7 BUS 64 [BUS 39 | BUS 


EXECUTION SUBSYSTEM 
Figure 2 - Functional diagram of an LAU 
processor 
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The Control sub-system is the truly original 
part of the processor : the Von Neuman program 
counter is replaced by two memories; the instruc- 
tion control memory (ICM) and the data control 
memory (DCM). ICM handles the three tags bits 
Co, Cl, C2 of the program instruction, while DCM 
will take care of the Cd tag bit of program data. 


The Memory Subsystem manages input requests 
and delivers memory operations to the other 
units. Special interest will be given to this 
box that may be the bottleneck of the processor. 


The Execution subsystem consists of N 
elementary processors connected on various busses. 
Each processor is totally independent from its 
neighbours, except for requests to external 
busses. 


We find it useful to explain the lower levels 


of the LAU processor by following the trip of machine 


instructions throughout the different parts of 
the system. The machine instruction (let's call 
it MI) has the following initial assignment : 


Co Cl C2 


© 
4 


RES OPER 2 


RES and OPER are data addresses 
2 is an immediate operand. 


Co Cl C2 are located in the Instruction Control 
Memory at the same address as MI in main memory. 
Co has been set to 1 at compile time, and means 
that MI is not nested in a control instruction. 
Ci denotes that the object OPER is not yet 
computed at this time, and C2 = 1 corresponds to 
the constant 2 in the second operand field. 


M I Magic Mystery Tour 


Let us come now to the micro-functional 
level of the design, and follow M I. 


Extracting M I address from the Control Unit 


The machine instruction, computing OPER, 
is being performed in some processing unit. It 
sends the Control Unit a signal indicating 
the computation done. 


The UPDATE module will set the Cl bit to 1 
at MI address. Now M I is virtually executable. 
In parallel, with UPDATE an Instruction Fetch 
Processor (IFP) is permanently looking for "111" 
configurations in ICM (simulating a not yet 
available associative mechanism). IFP examines 
only the ICM portion where there is a chance to 
find out ready instructions, and it is capable 
of sending their addresses every 120 ns. Thus 
M I will eventually be checked as a ready ins- 
truction. All things happening elsewhere in the 
machine will never affect M I, due to the unique 
just computed value of OPER guaranteed not to 
change. IFP sends M I address to either a FIFO 


CcOC1C2 | INTERFACE . INTERFACE 


"OPER computed" 
Request from PU. 
[ UPDATE read/ enka 


MEMORY 


: | : 
Baa OUTPUT 
MEMORY p-Plex 

"| BANKI 


INSTRUCTION 


write FETCH MI 
PROCESSOR avs 
(IFP) 
MEMORY 
short path normal path 
"MI is ready" "MI is ready" 
DATA BUS | 
INSTRUCTION BUS 
normal path ; MI address EXEC. SYSTEM EXEC. SYSTEM 
into waiting queue | 
Figure 4 - Functional organization of 
second short memory 
path re 
"output port | ; ; or 
eeu FIFO MEMORY | 64x16b The input policy gives priority to the 
execution subsystem (accesses to data operands 
enema from processing units), which may send requests 
every 60 ns. When there is a gap in this flow, 
| " . - MI address is processed by the Memory Input 
U PUT POR Manager, whose job is to translate the request 
¥ to MEMORY into a standard memory request (one of 16) and 
SUBSYSTEM then to drive the request to the corresponding 
: Memory Bank, and it lasts 60 ns to do so. 
Figure 3 ~- Instruction Control Memory and The activated Memory Bank stores the request 
Instruction Fetch. into a buffer of waiting operations. Still here 
ordering is not relevant and a LIFO or FIFO 
; scheme may be adopted. Overflow in the buffer 
file buffer, or directly, to the Instruction is prevented by the following : 
Output Port of the Control Unit, using a short From Input Plex 
path, thus eliminating systematic buffering of GOMS Standard request "Ins. Read 
requests. The File is a 64 x 16 bit FIFO stack se MI add" ° 


and may deliver an instruction address every 
120 ns. When full, meaning 64 instructions 
waiting for execution, the file inhibits the 
Instruction Fetch Processor. Notice that when 
this occurs, IFP will not make access to ICM : 
this will speed up the UPDATE device, and 


consequently, the servicing of processing units oa 


INREQ | 


AMD 29705 
(32x56b w.) 


which will become idle sooner, thus will accept LNHIBIT 
new instructions sooner, and so accelerating 
requests for ready instructions located in the 
file. Beside that, the short path is possible | 
because ready instructions are fully independent 
from each other, and ordering them is irrelevant. 
This property allows a straightforward implemen- 
tation of the instruction fetch mechanisms. Here 
parallelism is not perturbated by logic cons- 
traints but only at level of accesses to the 
Instruction Control Memory. 


MEMORY 
CONTROL 


Reading M I in the Memory Subsystem 


The Instruction Output Port of the Control Unit cece eee ene 


is one of the three possible inputs to the © MI instruction — ) TO OUTPUT p-Plex 
Memory Subsystem, organized as follows : | 
Figure 5 - Micro-functional description of 


a Memory. bank 
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The Memory unit which stores program and data 

instructions, is organized in two independent 

4 K x 32 bits memory blocks allowing read and 

write operations on each block in one 480 ns 
cycle. 


Thus, MI is pushed on to the buffer, or 
directly conveyed to the request register of 
the Memory Unit. An instruction read operation 
is performed, the result stored into the Output 
Memory Latch 8 minor cycles after the cycle 
initiation. The Output Memory Manager looks for 
requests coming from the 8 Memory banks, and 
will take MI within the next 480 ns. This 
request is decoded , translated and sent on to 
the appropriate output bus (here, the Instruc— 
tion Bus). 


In the Memory subsystem, parallelism is 
achieved by the interleaved organization of the 
Memory : 8 requests every 480 ns can be completed 
making a 60 ns - minor cycle (and, for those 
who like large numbers, 1 066 10° bits/second, 
parity bits not included). Asynchronism can be 
seen at the level of the acceptance of input 
memory requests : read, write, instruction 
read operations are not to be ordered to gua~ 
rantee a correct program execution. However, 
priority and clock synchronization are necessary 
at the logic level, to deal with all requests 
without loss of information. 


M I enters the Execution Subsystem 


The Instruction Bus is connected to one of 
the six controlers managing the execution sub- 


system. 
INST. BUS 
CONTROLER 
DATA RQST 
CONTROLER 
hg DATA IN 
a CONTROLER 


(Control Unit : 
~ 1,5Kx64b #-Inst 
i- 240 ns pr-cycle 


16x16 monitor memory 


ALU 
16 bits 
4 Am 2901 


DATA IN 
CONTROL 
CONTROLER 


PROCESSOR 1 


PROCESSOR 24 
(present machine) 


Figure 6 - Organization of the Execution 
ubsystem 


Any number of elementary processors may be 
attached to the six busses whose functions are 
given in the diagram. The instruction bus con- 
troler is composed of input file containing ins- 
tructions to be processed, and once more, a 
short path permitting an instruction to be driven 
directly onto the internal instruction bus. The 
instruction controler checks for idle processors 
and allocatesthe bus to the first one which 
will accept MI; MI is stored into the Ins- 
truction Register and execution starts. An 
elementary processor is built from AMD 2900 
series bit-slice microprocessors. This 16-bit 
data processor has no local memory (other than 
the 16 internal registers), but contains a special 
memory used for monitoring and workload evalua- 
tion , and runs microprograms corresponding to 
the whole instruction set. An extensible micro- 
assembler has been developped and is used to 
generate 64-bit micro-instructions composed of 
20 fields. Some of these fields are dedicated 
to monitoring and spying operations that can be 
performed in parallel, thus not affecting the 
actual execution of instructions. Once in the 
processor, M I is interpreted as follows 


- Decode CODOP 
~ Make a data operand access request, corres- 
ponding to OPER on the data out request bus. 


-~ As MI uses an immediate value as second 
operand, the constant 2 is prepared in a 
register 


~ When the request is performed (it may occur at 
any time later, and depends on the load of 
the data out controler, the memory input 
Manager, the Memory Block interested, etc...) 
The result (i.e OPER value) comes back to 
the data In Request Controler which signals it 
to the processor on which M I is being perfor- 
med. 


Then operation proceeds : RES is computed, and 
special control primitives are performed : 


~ Send a write-read request to memory at address 
RES : the value of RES will be stored in, and 
the link fields will be brought back to the 
processor 


- in parallel, send a write request to Data Con- 
trol Memory : the Cd tag bit will be set to l, 
and the Data Control Unit will send back an 
acknowledgement. 


- Once the link fields are present in the pro- 
cessor, send update requests to the Instruction 
Control Memory. Each link field consists of an 
instruction address using RES as an operand 

and one bit indicating a left or a right operand. 
The Cl or C2 bits will be set to one and 
eventually will make new "111" configurations 

in the ICM. 


There are other devices in the Data Control/ 
Unit which run in parallel with the Update 


_ processor. They implement the control primitives 


(creation of data descriptors, descriptor checking 


single assignment verification) that we shall 
not discuss here. 


In the execution subsystem ; 


—- processors are identical and independent from 

- each other. They perform instructions given 
by the Instruction Bus controler in parallel, 
and work asynchronously. They are managed by 
the Controlers for their requests to the out- 
side. The number of processors can be extended; 

though not yet much studied, it should be 

interesting to discuss the capabilities of 
fault tolerant or degradated processing in 
this part of the system. 


nS pipeiinine of requests makes no problem and 
may be short-circuited in some cases, thus 


speeding up the actual throughput. — 


- within a processor, the microprogram may . 


send simultaneous requests to the output busses © 


However, every request has to be acknowledged 
_ before the processor may be considered as idle. 


Simulation results showed that the major 
problems lie in the memory subsystem, and not 
in the execution subsystem. ~The prototype will 
thus be built with 16-32 elementary processors, 
with a 240 ns microinstruction cycle. 


Its connection to a host system will allow 
the use of peripheral devices. In particular, 

a library of LAU programs, located on a disk. 
will be managed by the interface unit which 


will also give facilities for tracing, monitoring 


and data input/output. 
Conclusion 
The LAU system is an overall hardware/ 


software system which makes use of data- 
driven sequencing principles at all levels : 


a high level language, implemented and a general © 


multi-processor architecture have been studied. 
A machine language and a compiler producing 
executable code from user's programs have been 
defined, together with a data driven processor 
structure. The processor is composed of any 
number of processing units running in parallel, 
and a special control unit implementing the 
data driven mechanisms. Parallelism is securely 
achieved by the independence of data flows in 
the machine, due to the single assignment rule. 
Use of pipelined mechanisms allows a bufferization 
of requests when needed and a better throughput 
in the system when saturation occurs in some 
functional part of it. However, the pipeline 
mechanisms can always been short-circuited in 
case of low workload, without any trouble for 
the determinism of computation. Asynchronism 

ils the main feature of the system : instruction 
fetch, memory operations for any number of: 
processors, instruction execution and control 
updating functions are achieved concurrently. 
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Only logic constraints may sometimes affect the 


operation in the system. These characteristics 
allowed us to be interested in the actual problem 
of implementation at the chip level and to have 


a straightforward design period. The LAU system 


waffle 1 is expected to be operational by the 
end of 1978 and application prderame will be 


evaliatce at / that time. 
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Summary 


This paper presents the design of a highly 
asynchronous distributed computer system which em- 
ploys a data flow approach for handling parallel- 
ism in programs. 


The proposed system is a hierarchically 
connected network of modules, each of which op- 
erates in an asynchronous manner. The main conm- 
ponents af the system are a set of Computation 
Activation Processor (CAP) Modules, each executing 
an individually assigned procedure on the input 
data it receives. A Scheduler module regulates 
the sequencing and dispatching of all CAP proce- 
dures and data flow instruction packets. 


The data flow approach is based on direct 
initiation of each operation simply by the pres-— 
ence of the required operand values. In a 1974 
paper Dennis [1] first proposed a very basic ver- 
sion of a data flow language in which instruction 
execution was limited only by the data dependen- 
cies of the program. Dennis and Misunas [2] did 
preliminary work into the design of a computer 
based on this language. Rumbaugh [3] has 
expanded and improved the earlier version of the | 
data flow language proposed by Dennis and has de- 
veloped a multiprocessor architecture consisting 
of N identical Activation Processors. Each pro- 
cessor is capable of executing in a pipeline man- 
ner several data flow instructions at a time. 


The system we propose is a new, simpler im- 
plementation based on their work. We have parti- 
tioned our system into a number of asynchronously 
operating modules, each of which consists of a 
controlling processor and its associated memory 
structures. Our system consists of five major 
module types, namely an Interface Module, Assign- 
ment Module, Collection Module, Scheduler Module, 
and a set of N Computation Activation Processor 
(CAP) Modules. These five module types are inter- 
connected to produce a system in which the total 
functioning of the system is spread throughout the 
various modules, thereby realizing a distributed 
architecture system. Each module is only respon- 
sible for executing a specific, preassigned por- 
tion of the total system workload, with each mod- 
ule functioning in an independent manner. This 
approach provides a far superior method of system 
functioning than most contemporary systems since 
the distributed architecture concentrates the con- 
currency problems inherent in parallel computer 
systems in the module interfaces. Our module in- 
terconnection is accomplished through the use of a 
set of similarly functioning bus-to-queue inter- 


*This work was supported in part through the Post 
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faces. These interfaces reduce the module inter- 
action to the problem of putting items in and re- 
moving items from the queues in a non-conflicting 
manner. 


The Interface Module accepts programs, pro- 
cedures, and data input from the outside world in 
a high level language for later entry into the 
main system. 


The Scheduler Module has three main areas of 
responsibility: 1) accepting the transformed com- 
pilation structures from the Interface Module, 

2) dispatching completed operand packets to the 
Assignment Module, and 3) retrieving completed re- 
sult packets from the Collection Module. 


Each Computation Activation Processor Module 
contains a processor capable of performing a speci- 
fic operation stored in its procedure store. Op- 
erand packets are removed one at a time from the 
CAP's operand packet queue, placed in its working 
store, and then the assigned procedure is performed 
on the packet. After the computation is completed, 
a result packet is formed and'placed in the CAP's 
results packet queue where it awaits collection. 


The distribution of the processors and 
memory into modules, which perform individual 
portions of the system's total functioning, 
reduces the problems of memory contention and 
processor synchronization. The consistent way 
in which the interfacing and synchronization of 
the various modules is handled helps to simplify 
the problems involved in determining system 
failures. Since the system design permits us 
to isolate faults to one or more modules or 
interfaces, we are easily able to correct the 
problem by simply replacing the faulty module. 


As a final comment, we believe that our 
system design will allow us to implement many 
hardware aspects of the system with the use of 
low cost, currently available hardware. The 
individual processors may be realized using 
microprocessors, and the various memory struc- 
tures are easily implemented using LSI techni- 


ques. 
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Summary 


It is difficult to take full advantage of the 
parallelism in a problem when designing a computer 
containing many processors. Consequently, several 
new types of parallel computer architecture have 
been proposed with the objective of allowing 
efficient exploitation of problem parallelism. The 
class of configurable computers [1] , ineluding data 
flow machines (e.g-(2,3]), has evoked considerable 
interest because of its general applicability and 
inherently parallel nature. More recently, the 
possibility of interconnecting wery large numbers 
of microprocessors has been suggested, and data 
flow configurations hawe been proposed in this 
vein [4]. This paper outlines a basic ring- 
structured data flow architecture, and proposes an 
extended version (using connected, multiple layers 
of simple rings) in which an arbitrarily large 
number of processing units may operate. 


The basis of all data flow systems is a 
graphical computational schema (e.g.[3-7]). The 
schema used as a base for the ring-structured 
system is described in detail in reference(8]. It 
differs from other data flow schemata in that 
tokens on ares need not be maintained in first-in- 
first-out order. As a consequence, all tokens in 
a re-entrant graph must be labelled to ensure that 
they are distinct. A label comprises three fields 
which distinguish three sources of reentrancy, 
namely: (i) parallel data structure (array) ; (ii) 
iteration (sequence) ; and (iii) recursion. 
Specialised nodes can adjust all or part of a 
label when entering or leaving a reentrant section 
of a graphe The schema also restricts the maximum 
number of input (and output) arcs at a node to 2, 


The single 
processing area 
several stores. 

The instruction store contains a (read-only) 
linear representation of the computational graph : 
each instruction defines a nodal function and up 
to two possible destination instructions for the 
results (i.e. the output arcs). 

The matching store is associatively accessed 
and is used to hold those results (i.e. tokens) 
which cannot proceed to the next instruction 
because the necessary firing condition has not 
been met (i.e. because a second input result is 
not yet available). 
consists of the destination of the token together 
with its label. This uniquely identifies every 
result in the system provided that labels are 
distinct. 

The result queue acts as a buffer for named 
results between their leaving the processing area 
and their being dealt with by the assembler. 
Although called a queve, this store need not be 
strictly first-in-first—out. 


ring architecture contains a4 
and an assembler together with 


The name used for association . 
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The assembler takes results one-at-a-time 
from the head of the result queve and tries to 
form for each an executable instruction by 
accessing the instruction store (to find the 
next function and destinations) and (if necessary) 
the matching store (to find a matching input 
result for a two input function). If the last 
action is unsuccessful, the incoming result is 
saved in the matching store to await its partner : 
otherwise the ensuing executable instruction 
(i.e. function, two operands, common label and 
destinations) is sent to the processing area to be 
executed. . 

An input/output switch at the output of the 
processing area enables results to be transmitted 
into or out of the ring. 


In the multilayered architecture, the input/ 
output switch is extended to become a switching 
network connecting the outputs of the processing . 
areas to the inputs of the result queues in the 
system. The switching network resembles a sorting 
network [9] which directs each result according to 
some part of its name. 

Both switching network and individual rings 
can be constructed as pipelines : any number of 
rings may be connected without reducing the 
pipeline beat period. 
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Abstract -- The application of one-dimension- 
al bin-packing to the problems of efficient allo- 
cation of parallel resources is studied. Two 
classes of fast approximation algorithms are con- 
sidered for these problems, which are known to be 
NP-complete. Known bounds on performance are 
briefly surveyed; some new results are then deriv- 
ed for broad classes of problems for which the 
approximate algorithms are in fact optional. 


L. Introduction 


In the implementation of parallel processing, 
one encounters the sequencing of tasks on multiple 
processors and the allocation of information to 
parallel storage units as two common design de- 
cisions for which specific problems are defined 
and efficient solutions sought. With the nature 
of the units being assigned, allocated or sequenc- 
ed being known, these problems are normally in- 
stances of general bin-packing problems, In the 
sequel we shall define the basic model of bin- 
packing, couch the above sequencing and allocation 
problems in these terms, and then explore two 
classes of approximation algorithms that have been 
applied to these problems. In this study of the 
so-called first-fit and level algorithms we shall 
examine questions of optimality and performance 
bounds relative to optimization algorithms. It 
will be seen that the first-fit algorithms appear 
more promising in the applications considered. 


II. One-Dimensional Bin-Packing 


Informally, we are given a collection of m 
"bins" By, ..- 5 By of equal capacity, c, anda 
set P of n "pieces" Pao see > Py each of which 
has a size not exceeding the bin capacity. The 
piece sizes as well as names will be given by p,, 
1S isn. The object of any bin-packing problem 
is to pack the pieces into the bins so as to 
optimize some given measure of the packing. In 
applying this model to processor sequencing and 
storage allocation we have the following term 
associations (the first term applies to sequencing/ 
scheduling while the second applies to storage 
allocation): 


Bin: processor, storage unit (e.g. cylinder) 
Piece: task, record 

Packing: schedule, allocation 

Capacity: deadline, capacity 


We consider the following problems. 


95 


Joseph Y-T. 
Dept. of Computer Science 
Virginia Polytechnie 
Institute and State University 


Donald Slutz 
IBM Research 
San Jose, Calif, 


Leung 


is as large as necessary (m= n 
The 


Pl. Assume m 
will always suffice), and a fixed capacity. 
object is to minimize the number WN of bins 
required to pack P. 


This problem as well as those below have 
many applications outside of computer science. In 
computer operation Pl concerns the problems of (1) 
minimizing the number of units (cylinders, pages, 
etc.) necessary for the storage of a collection 
of variable size records, and (2) minimizing the 
number of processors needed to complete all tasks 
by a given deadline common to all tasks. 


P2. Assume m is fixed. The problem is to mini- 
mize c such that all pieces can be packed. 


P2 is the classical problem of sequencing 
to minimize make-span, or the design problem of 
finding a capacity such that all records can be 
placed in a fixed set of equal capacity storage 
units. 


P3. With m and c fixed maximize the number n 
of pieces (i.e. the subset of P) packed in the 
bins. 


P3 is clearly the problem of maximizing the 
number of tasks finished by some deadline (c), or 
the number of records stored in a fixed collection 
of storage units. 


Pl. 
‘ St eo m Zz 
pieces so as to minimize de Le 


1 
laval of Bae 


With m fixed and c unconstrained pack the 


where Lh. is the 


P4 has arisen in storage allocation applica- 
tions in which the object is to minimize average 
access times. In these problems the records are 
of equal size; the "piece sizes'’ correspond to 
stationary record access probabilities, 


Although P1-P4 by no means exhaust all pro- 
blems of the bin-packing type they are the princi- 
pal ones to which the simple approximation al- 
gorithms that we discuss have been successfully 
applied. For general parameter values each of 
P1-P4 is NP-complete; this fact has been a prime 
motivation for studying fast heuristics. Those 
having received most attention fall into two 
classes, heuristics in both classes assigning 
pieces one at a time as they are drawn in sequence 


Senne erence aE enna eee 
*The level of B; is the sum of the piece sizes in 


Bae 


from a given list L. 


iit. Level Algorithms 


These algorithms are distinguished by the 
assignment criterion: The next piece to be assign- 
ed is packed into the bin currently of lowest 
level. The largest-first (LF) algorithm is a lev- 
el algorithm that initially puts L into non- 
increasing order (as with the first-fit algorithms 
below, it is only the assumed ordering of L that 
distinguishes two different level algorithms). 

The LF rule has been applied to P2 and P4; it is 
illustrated below for P2. 


L=(1/2, 2/5, 3/8, 1/3, 5/16, 5/16, 1/4, 1/5, 


1/6, 3/20) 
m= 3 
47/48 17/16 23/24 levels 
| 3/20 
1/6 1/5 1/4 
5/16 5/16 1/3 
1/2 2/5 3/8 
oy Bo Bs 
_LF Rule Applied to P2 (c¢ = 17/16) 
min 
1 1 1 levels 
7 5/16 3/20 
1 
= 5/16 1/5 
3 
$ 3/8 2/5 
By Bo B, 


An Optimum Packing (Cant) 


Let us define the performance ratio for an 
algorithm and a given m as the worst-case ratio of 
the performance of the given algorithm to that of 
an optimization algorithm. The asymptotic per- 
formance ratio is the limiting value as m7”, 
For the cases at hand performance means the maxi- 
mum bin occupancy (least capacity necessary for an 
LF packing) for P2, and the sum of the squares of 
the bin levels for P4. It has been shown that the 
LF performance ratio is 4/3 - 1/3m for P2 and is 
bounded by 25/24 for P4. It is not known whether 
the latter is achievable, but this is conjectured 
not to pe: the case. 


The shortest-first level algorithm has been 
applied to P3, but it has a performance ratio of 
m/ (2m+1), which is relatively poor as we shall see 
later. LF level algorithms can also be applied to 
Pl and P3, but their performance ratios are easily 
shown to be inferior to those of the corresponding 
first-fit algorithms described below. 


IV. First-Fit Algorithms 
The first-fit (FF) algorithms are more ex- 
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the LF algorithn, 


plicitly goal oriented, in contrast to the level 
algorithms. With these algorithms, the bins are 
scanned in the order B,,B9, ... until a bin is 
encountered which will accommodate the next piece 
to be packed. An effective FF algorithm is the 
first-fit-decreasing (FFD) algorithm which, like 
initially puts the list into 
non-increasing order. The FFI ("I" for "increas- 
ing") algorithm initially puts the list in a non- 
decreasing order. The FFD ora is illustrat- 
ed below for Pl. 


= (1/2, 2/5, 3/8, 1/3, 5/16, 5/16, 1/4, 1/5, 


1/6, 3/20) 
c= ] 
1/10 1/24 1/120 17/20 unused 
1/6 capacity 
. «1/4 1/5 
2/5 1/3 5/16 | 
1/2 3/8 5/16 3/20 
By B, B3 By 


FFD Rule Applied to Pl (4 bins) 


0 0 ) unused capacity 
3/20 

1/6 5/16 1/4 

L73 5/16 1/5 

1/2 3/8 2/5 

ar! PD es 


An Optimum Packing (3 bins) 


Variations of the FF algorithms are the best- 
fit (BF) algorithms, in which the bins are scanned 
as before, but with the result that a piece is 
placed in that bin of lowest index for which the 
resulting unused capacity is least. 


The FF, BF, FFD, and BFD algorithms have been 
applied to Pl; the a symptotic performance ratios 
have been shown to be 17/10, 17/10, 11/9, and 
11/9, respectively. 


The FFI algorithm has a performance ratio of 
3/4 when applied to P3. Improved performance for 
P3 is obtainable from an iterated FFD algorithm 
which works as follows. L, assumed to be in non- 
decreasing order, is scanned until a largest r is 


found such that al P;S mc, where mc repre- 
sents the total capacity. The list P1=Po%. - - SP, 
is then scanned in reverse order and packed accord- 
ing to the FFD rule. If the rule fails to pack 
all pieces then the largest piece (P,.) is discard- 
ed and the process repeated on Py> e+ 5 Pyaye 
Largest pieces are discarded iteratively until a 
packing of all remaining pieces succeeds. The 
asymptotic performance ratio of this rule is known 
to be in the interval (6/7, 7/8]. 


The same concept has been exploited in apply- 
ing the FFD rule to P2. In this case, the FFD 
rule is iterated on the entire list in a (binary) 
search for a c in a bounded range. The result 


is an algorithm whose performance ratio is known 
to be in the interval [20/17, 61/50) for m2 8; 
for m= 2,3 and 4 Sm S87 _ the performance ratios 
are precisely 8/7, 15/13, and 20/17, respectively. 


The FF rules have yet to be applied to P4, 
although an algorithm based on the FFD rule appears 
to have superior performance to the LF level 
algorithm, 


V. Optimality Tests 


There are a number of important cases when 
the simple heuristics we have discussed are in 
fact optimal or asymptotically so. For a given 
positive real, a, let S(a) denote the set of 
positive powers of a; i.e. ta, a“,... }. 


* 
Theorem 1 


If for some a, p,¢ S(a) for all i, then the 
LF and iterated FFD rules are optimal for P2. 
Moreover, the FFD rule is optimal for Pl. 


This result verifies optimality, in particular 
reference to computer applications, when all task 
execution times, or perhaps more appropriately all 
record sizes, are powers of two. In addition, this 
is true for P2 even when an arbitrary level al- 
gorithm can still do fairly badly, as shown in the 
following generalization of a result due to 
Graham [3]. 


Theorem 2 


Suppose the piece sizes can be normalized so 
that for all i, p, is an integer. Then an arbi- 
trary level algorithm for P2 can perform worse 
than optimal by a factor of up to 2 - 1/min{m, 
meiec) . 

A result somewhat stronger than Theorem 1 can 
be proved for the FFD rule when applied to Pl. 


Theorem 3 

Suppose p. divides c for all i. Then for 
Pl an FFD packing never requires in excess of one 
bin more than an optimum packing. 

A test for optimality existing with the LF, 
but not the iterated FFD, rule applied to P2 is 
given next. 

Theorem 4 
Let wy, be the maximum bin occupancy on 


applying the LF rule to an instance of P2 in m 
bins. If 


n 
3 1a 

Wy >(3 : a p, /m 
j=1 


then LF is optimal. 


*Results in this section are proved in the 
appendix. 
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This result states that if the LF rule is 
forced to do "too badly" with respect to the 


absolute a a al p, /m, then it must be opti- 
i= 


mum. For example, if the LF rule ever does 50% 
worse than the absolute minimum, then it must be 
optimal. 


These are useful tests for optimality which 
allow one to avoid expensive enumerative or 
approximate approaches when solutions very close 
to the optimum are always needed. It is obviously 
desirable to derive conditions that are necessary 
as well as sufficient; however, the nature of the 
problem appears to make this a very difficult 
question. | 


VI. Open Problems 


Worst-case performance bounds have obvious 
shortcomings as criteria for algorithm selection. 
This problem is aggravated by the fact that level 
and first-fit algorithms are seldom comparable in 
the sense that one always does at least as well 
as the other. Indeed, worst-case examples for 
one are frequently handled optimally by the other. 
Thus, although worst-case bounds may point to one 
algorithm, there is no assurance that that al- 
gorithm is statistically best. 


Worst-case examples are somewhat pathological 
in some cases, and rely, for example, on large 
variations in piece sizes; thus, it would be of 
considerable interest to parameterize bounds in 
terms of this variation. Further results con- 
cerning instances when certain algorithms are 
optimal would also appear possible. Simple pro- 
bability models appear to be of greatest interest; 
with such models one can hope to quantify such 
Statements as: "algorithm performance will be 
within x percent of optimum with a probability 
bounded by p". 


The FFD algorithms exhibit anomalies in many 
cases. These occur in their application to Pl, 
P2 and P3. For example, there are lists for Pl 
such that if a piece is removed, the number of 
bins required to pack (by the FFD rule) the re- 
maining pieces becomes one greater. It would be 
helpful to obtain some usable estimate of the 
maximum effects of such anomalies. 
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Appendix 


Proof of Theorem 1. 


Consider the LF rule applied to P2 on m 


machines under the assumptions that Py=Po=++ =P, 


and each p,cP is a positive power of some posi-~ 
tive number a. (For this part of Theorem 1 we 
shall use the terminology of task scheduling.) 
Suppose, contrary to the theorem, that the LF rule 
is not optimal for P, and suppose that P is a 
smallest counterexample. Thus , if w, and Wopt 
are the respective lengths of LF and optimum 
schedules for P, then wy, > Wopt- 


Suppose p_, the smallest task, does not 
terminate the Ver schedule; i.e. suppose p, does 


not have a latest finishing time in the LF schedule. 


Then we note that the LF schedule length for 
P-{p_} must be equal to w, while the length of an 
optimum schedule for P-{p,} can not be greater 
than Wop-- Thus, P-{p,} provides us with a 
smaller counterexample, and this contradicts the 
assumed minimality of P. Hence, we may assume 
that Pp, terminates the LF schedule. 


Finally, observe that p, divides p,; for all i. 


It follows easily from the LF rule and the obser- 


n 
vations to this point that Wr, 3 p,/| ; 
cag A 


where Fle denotes the least multiple of x_ 


: n 

greater than or equal to y. But} 2 p, /m| is 
: | tele IP 

a lower bound to the length of any schedule for P 

and hence the LF schedule is optimal. 


Consider now the FFD rule applied to P1- 
under the above assumptions. Assuming as before 
that P is a smallest counterexample implies 
that p, uniquely occupies the last bin. Since py 
divides all p,, this in turn implies that every 
bin, except the last one, must be filled to a 


level of lel - Since no packing can fill a bin 
n 


to a higher level, the FFD rule must be optimal. 


With reasoning similar to the above it is 
easily shown that the iterated FFD rule is also 
optimal for P2 under the assumptions of the 
theorem, 


Proof of Theorem 2 


Suppose that each p, is an integer. We 
want to consider the largest possible ratio be- 
tween two schedule lengths arising from two differ- 
rent level algorithms applied to P2 on m machines. 
(We shall use the terminology of task scheduling.) 
First, we dispose of the case when py=max{p,}<m. 


Since all tasks are integral and each task 


is assigned in its turn to the first available 
processor, we have for a maximum schedule length 


n 
w' s Py +| Ep, lm 
i= 


For a minimum schedule length we have the lower 
n 


bound w= X p,/m, and hence 
i=1 1 


n n 
2 p,/m ar - py/m Sw - p,/m 
i= 2 1 


But w must be integral, and since p/m <1 we 
have 


E pln < |v - p/m) ew 1 


or 
n 
| 2 p,/m 
i=2 + 1 
an ST eK 
= W 
Hence, n 
Wo eh oe St gg Dee ye ce 
W W W w W 
Since w2 P, we get 
Wo 2a) i 
Ww Py 


As shown in [3], 2-1/m is a general bound — 
on w'/w; in particular, the bound holds when 
Py >m. Thus, 


we = 2 - 1/min[m, maxt p,} ] 


An example showing that the bound is achievable is 


provided by the parameters n=2m-1, Pi= k-1(2Sism), 


and p,=l (m+1Sis2m-1) for any integer ksm. [] 
Proof of Theorem 3 


For the FFD rule applied to Pl, we assume 

that p. divides c for all i. For simplicity we 
i 

may assume c=l, and hence that the p. are all 
unit fractions. Assume the FFD packing requires 
m bins of which j=sm have a non-zero unused capa- 
city. Let 1/k be the size of the largest piece in 
the last bin, B, . Thus, at least j-1 of the in- 
completely filled bins must be filled to a level 
exceeding (k-1)/k.) Consequently, since 1/k is a 
lower bound to the level of the last bin, we have 


n 
% p, > (m-j)+(Cj-1) (k-1)/k+1/k 
iz. i 
or 
n 
x p, > m-j/k -(k-2)/k . 
i=1 * 


Clearly, each incompletely filled bin, except 
possibly the last, must have at least two differ- 
ent piece sizes in it. Therefore, since the pie- 
ces are packed in a non-increasing order, we have 
j =k and 


n 
Ep, >mi- <25m2 
i=1 i . 


Finally, if N, is the optimum number of bins, 


pt s 

we have the obvious lower bound N,,,=| % »,| é 
ae cs ae 

Making use of the previous inequality, we get 


Nopt 2m-1; i.e. an optimum packing requires at 


most one fewer bin than the FFD packing. [| 
Proof of Theorem 4 


Using scheduling termonology we shall show 
that for P2 the LF rule is optimal if in the re- 
sulting schedule for P we have 


n 


First, note that the starting time of any task 
must not exceed any processor finishing time. 
Thus, if P is assumed to be a smallest counter 
example for the theorem, than we have as in the 
proof of Theorem 1 that the smallest task, say p,, 
terminates the schedule. Hence, 


n-1 


Spt = ~ 
w, =P py p./m= p (1 


nN 
at 2 py/m 
i=l 


Using the above two inequalities we obtain 


ate i piece oes s p/m 
n m i=z1 i Q @m°s5, i 
which reduces to 


n 

2 i 

Py > p,/m 
i=1 

But this means that there are at most two tasks 

per processor in the LF schedule. (If a processor 

had three or more tasks, then conservation of 

n 

2 p, would require that the finishing time of 

ix] - 

some other processor precede the starting time 

of the third task scheduled.) Finally, it is 

routine to verify that an LF schedule with at 

most two tasks per processor is an optimum 

schedule. This contradiction proves the 

theorem. ta 
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Abstract-- Multiprocessor scheduling strategies 
have been the focus of substantial research in 
recent years. The complexity of the general 
scheduling problem necessitates an abstract mathe- 
matical model to analyze scheduling algorithms 
with respect to their worst-case performance 
bounds. This paper is to study the task scheduling 
problem for a Task Dependency Structure which 
differs from the general precedence relation on 
tasks only by one restriction: All immediate 
successor tasks of a branch task must not be 

merge tasks. In particular, it is shown that for 
m > 2 processors the ratio of the length of an 
arbitrary level schedule for N-free task structur- 
es and of the corresponding optimal schedule is 
bounded by 3/2. Furthermore, if m= 2, then a 
level schedule is always optimal for N-free task 


structures. 


I. Introduction 


Multiprocessor scheduling strategies have been 
the focus of substantial research in recent years. 
The complexity of the general scheduling problem 
necessitates an abstract mathematical model to 
render an analysis of scheduling algorithms with 


respect to certain performance goals. 


One of the most common models is the so-called 
"General Multiprocessor System' proposed by 

Graham [1]. This model is defined by the following 
components: | 

1) A set of m identical processors P., 


I= Tak acgtc 


2) A set of tasks T = (tyoeee st) which is to 
be processed by the P.. 
3) The general task dependency structure < on T 
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which is a binary relation that is anti- 
symmetric and transitive. | 
4h) A function yu: T+ (0,°) which denotes the 


execution of each task te T. 


The performance goal to be considered here is to 
find scheduling algorithms which minimize the 
total time required to execute the task set T 


on the processors P.. 


Ullman [4] has shown that this problem is poly- 


nomial complete even if one of the following | 


restrictions holds: 


1. All tasks t eT require one unit of time for 
their execution. 
2. All tasks t eT require one or two units 


of time for their execution and there are only 


and P.. 


two processors Pe, ‘ 


This result it tantamount to showing that the 


problem of optimal task scheduling according to 


this model is computationally intractable. More- 
over, worst-case investigations of the known poly- 
nomially bounded scheduling algorithms have 

shown that the length of the generated schedules 
(2-1/m) 
times the length of the optimal schedule. This 


can be approximated arbitrarily close to 


value, however, 1s an upper bound on every 

demand schedule, which allocates an idle 
processor to anyone of the tasks that are ready 
for processing. Hence, it does not appear reason- 
able to apply sophisticated algorithms unless the 
scheduling problem conforms to a simpler model 
which, beyond being confined to unit time tasks, 
meets some additional restrictive conditions. 
These restrictions may be imposed on the number 
of processors or on the task dependency structure 


(TDS). In fact, it has been demonstrated that 


optimal schedules can be generated by poly- 
nomially bounded algorithms for the case that the 
number of processors is confined to two, while 
the TDS, according to Graham's model remains 
unrestricted [2], and for the case that the TDS 
constitutes a tree structure while the number of 


processors remains unbounded [3]. 


Tree structures, however, do not reflect the type 
of tasks dependencies that can typically be found 
in conventional computer programs. Recently, a 
more adequate class of restricted TDS, the so- 
called series-parallel graphs, have been investi- 
gated by Goyal [6]. However, the upper bound on 
the length of a level schedule for these graphs 
is (2 - 5) times the length of an optimal 
schedule. With increasing m, this value can be 


approximated arbitrarily close by the worst 


possible bound (2 - +), 


These two examples - tree structures on the one 
hand and series-parallel structures on the other 
hand - indicate that two objectives have to be 
met when imposing restrictions on the TDS with 
regard to the generation of task schedules by 
polynomially bounded algorithms: 

1.) The algorithm must produce at least sub-. 
optimal schedules, i.e. the upper bound on 
the length of the schedule must be decidedly 
lower than the worst case upper bound to 
justify the increased computational effort 
as compared to arbitrary demand scheduling. 

2.) The class of TDS that conforms to the first 
objective must reflect structures that appear 


in real processes. 


The purpose of this paper is to present TDS's 

which meet these objectives with regard to level 
schedules. In the next section, we introduce 

rather informally the class of the so-called N- 
free TDS's, and in the third section, we demon- 
strate that level schedules for these TDS's are 
bounded by 3, 
improvement over the worst case bound. In section 


which represents a considerable 


4, we show that N-free TDS's cover the structures 


of nested DO-loops, the Fork-Join concept, and 
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also comply with the rules of structured 


programming. 


II. N-free Task Dependency Structures 
When trying to define a class of TDS's that 


yields at least suboptimal level schedules, it 
is, as a first approach, useful to identify 
those TDS-elements which may cause level 


schedules to become non-optimal. 


Consider, as an example, the TDS in figure 1, 
which belongs to the class of the so-called 


multi-linear structures. 
Figure 1: i 


A level schedule for such a simple structure is 


always optimal, since every task that has been 
completed frees its successor task from the next 
lower level for execution. Hence, the tasks can 
be processed level by level without violating 
any dependency relations. This situation changes, 
if tasks with more than one successor and more 


than one predecessor are permitted. 


Definition 1: A task t inaTDS G 1s called 


a branch task, if the number of its immediate 
successors is greater than one; correspond- 
ingly, a task t is called a merge task, if 
the number of its immediate predecessors is 


greater than one. 


Now, it is conceivable to have a task dependency 
structure as shown in figure 2, where the tasks 

1,2,3,4 belong to some level | 
5,6,7,8,9 belong to the level 


&, and the tasks 
Boris 


Level: 


We assume, that the number of processors is three. 


Clearly, if during the first time slot the tasks 
1,2,3 are being performed, then only the tasks 4 
and 5 can be executed next, leaving one processor 
idle. It takes another two time slots to process 
the remaining task 6,7,8,9; i.e. four time slots 
altogether. If, however, task 4 is executed 
during the first time slot, then all processors 
can be kept busy and the entire job is done in 


only three time slots. 


Another example of a TDS-element which may be 
responsible for erroneous scheduling is shown 
In fis. 3a. 

Figure 3: 


Level: 


Suppose the tasks 1,2,3,4 belong to the level 2, 
and the tasks 5,6,7,8 belong to the level 2-1. 
If the number of processors is three, then it is 
imperative that the task 1 be processed during 
the first time slot while two more tasks from 

the level 2 can be arbitrarily selected. Other- 
wise, all tasks of the level 2-1 would remain 
blocked for one more time slot, during which only 
the task 1 could be executed. The entire job 
would require four time slots, compared with only 
three time slots for an optimal schedule. Task 1 
looses its blocking effect with sopra to level 
scheduling, if its level is lifted above the 
level of the tasks 2,3,4, in which case the task 
| can never be scheduled later than the tasks 
2,3,4. This may be caused either by a higher 
level of task 5 relative to the tasks 6,7,8, or 
by an additional task along at least one of the 


edges leading from task 1 to the tasks 6,7,8. 
Thus, we have: 


Definition 2: A task t with level 2 ina TDS 
G is called a blocking branch task, if at 
least one of its immediate successor with 


level &-1 is a merge task. 


The simplest structure that fulfils definition 2 
is composed of only four tasks, and, as shown 

in fig. 3b, has the shape of an N. These N- 
structures constitute the most critical TDS-ele- 
ments as far as level scheduling is concerned, 
since errors can be made with the least number 
of tasks involved. A large number of N-structures 
within the TDS may therefore cause a level 


schedule to rapidly approach the worst case bound 


1 
ies alae 


Hence, it appears rather rewarding to investigate 
the: scheduling problem for TDS's which are free 
of these N-structures. It is reasonable to 

assume that, on the one hand, these - only 


slightly restricted - TDS's yield at least sub- 


optimal level schedules and that, on the other 


hand, computer programs can be kept free of 


blocking branch tasks. 


TTI. Worst case bounds for level schedules on 
N-free task dependency structures 


In this section, we submit and proof the basic 
theorems regarding the length of level schedules 
on N-free TDS's. 


Definition 3: A TDS G is called N-free, aly a Ba 


has no blocking branch task. 


Theorem 1: Let G be an N-free TDS, let We be 


the length of an arbitrary level schedule, 


Y ot be the length of an optimal schedule 

for G. Then, for m> 2 processors it holds: 

“e,  3 

~ £5 and this is the best possible bound. 
opt 


Theorem 2: If the same assumptions apply as in 


theorem 1, then for m= 2 processors every 


level schedule for G is optimal. 


These two theorems may be proved by means of the 
following definitions and lemmata. Some of the 
definitions can be found in [Shs most of the 
proofs for the lemmata are omitted since they 


are of a more or less technical nature. 


Definition 4: Let S be a schedule for a TDS G. 
Let w(S) denote the length of S given by 


the number of time slots required to execute 


G according to schedule S. S(i) denotes 
the subset of all tasks te€ To that are 
executed during time slot i. If S(i) is 


smaller than the number m of the given 
processors, then time slot i in schedule 5S 
is called an '‘Incompletely Occupied Time Slot' 


(abbr. IOTS). 


Lemma 1: Let 1 be an IOTS in a schedule S_ for 
G. Then the set N(S(i)) 
of the tasks belonging to S(i) 


the set of all those tasks that have not yet 


of all successors 


is equal to 


been executed at the end of time slot i. 


It should be emphasized, that this lemma, of 
course, is correct only since we are dealing 
with demand schedules. 

in § 


Definition 5: An IOTS i is real if there 


exists an optimal schedule R for G 


that S(i) = R(j) 


SO 
and j <1; otherwise, i 


is called virtual. 


1 be a virtual IOTS in a non-optimal 
schedule 5S for 
least one task +t S(i) 
R. for .G 


Lemma 2: Let 
G. Then, there exists at 
which in every 

optimal schedule is executed in a 


preceding time slot. 


Definition 6: A virtual IOTS i in a non-optimal 
schedule S is called primary, if there exists 
a task reS(i) anda task te S(2), (2<i), 
so that for an arbitrary optimal schedule  R 
for G holds: re R(j) and te R(k) and 
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Lemma 3: Every non-optimal schedule S has at 


least one primary IOTS. 


Definition 7: A task r is said to dominate a 
task s 
of 


set N(r) 


inaTDS G, if the set of successors 
s, N(s), is a subset of the corresponding 


of r. 


The dominance criterion requires that a task is 
always executed no later as those tasks it 


dominates. 


Lemma 4: Let § be a schedule for a TDS G so 
that the dominance criterion is not violated 
in S. Then there exists no primary IOTS i 


in 38) With. (SG) ) =.1 
Proof: Assume there exists a primary IOTS with 
min {j|j is a 


1} and let 


the above property. Let i := 


primary IOTS in S_ and Is(3) | = 


Stig (23% 


According to lemma 1, all tasks that are executed 
in § after task +t belong to the set of 
successors N(t) of t. From the definition of a 
primary IOTS follows that there must exist at 
least one more task s, which is independent of 
t and which can be executed in time slot i 
without involving an IOTS; otherwise, i would 
be a real IOTS. Therefore, the following relation 
holds for these two tasks: N(t) 4 N(s) and + 

is executed later than s in schedule 5S. This 
is a violation of the dominance criterion and, 


hence, a contradiction to the assumption. 


Lemma 5: Let G be an N-free TDS, let S be an 
arbitrary level schedule for G. Then the 


dominance criterion is not violated in S. 


Proof: Assume the dominance criterion is violated 


in S. Then there exist two tasks s and +t so 
that N(t) > N(s), and s is executed before t, 
ies Q(s) > L(t). Then, the set of immediate 


s, DN(s), must be a pure subset 
DN(s ) 


successors of 
of the corresponding set DN(t). Since 
cannot be empty, there exists a task r with 
r e (DN(t) vn DN(s)). Hence, task t isa | 


blocking branch task. This is a contradiction to 


the assumption that G is N-free. Figure }: 


Processors 


Lemmata 4 and 5 immediately lead to the following 
| Schedule S Schedule R 


Lemma 6: In an arbitrary level schedule S for 
: . 48 1° 
an N-free TDS G there exists no primary [OTS | 
i with |S(i)| = 1. | 


1 —> 
Obviously, in every IOTS of a schedule S for Time slots| * Uli.g) re 
two processors only one task is executed. Consider- J 
ing the foregoing lemma 6, it follows that in mo 
every level schedule for an N-free TDS G and Jt a 
two processors there exists no primary IOTS. 
According to lemma 3, in every non-optimal 
schedule, however, there exists at least one 
primary IOTS. This proves the correctness of 
theorem 2. | Comment: From the definition of a primary IOTS 


follows, that i' <i and j'> gj. 
Lemma 7: Suppose, in a schedule S for aTDS G 


all the time slots i, it1,...,j, (i<j), are The rest of the proof of theorem 1 can now be 
IOTS's and at least one of these IOTS's is given straightforwardly: 

. : : : J 
primary. Then all IOTS's i,...,j are primary. Ree he iw s(2)|. Then the following equat- 


Q=1 
Definition 8: Let the time slots i,...,j,(isj), fen NOMS Por tae. Sotimal-scnedule: “hs 
be consecutive primary IOTS's in a schedule S 2) oe 
: di cas R(2)/= o + with > O. The int 
for N-free TDS G. The neighborhood U(i,j3) ioat | | P P | Sn ere 


of these IOTS's is defined as follows: 


k and h are defined as follows: k := j-iti 
e ° roan ° ° : 4 
U(i,j) := {ul J its 2'< j" S(u) mR(2*) # OI edd ce At tee 
where the time slots i' and j' are defined as 7 
follows: Let R be an optimal schedule for G then Then the number of time slots that 1s necessary 
; : J | to execute the o+p tasks in schedule R is 
i' := min {R(u') mn |) s(2) # ¢} a4 
ui ; . PL 
Q=i . | given by Moot = kth. To execute the same number 
, J of tasks in the level schedule S wo Ps + py (a) 
j' := max {R(u') 9 |) s(z) # o} > Wy | 
Ls Q=i time slots are necessary, where m is the number 


(b) 


of processors. Since o+tp < (kth)-m, 2k+h- || 
O+p . 
QR 


ee + 
minimal and, subsequently, we 


if in every primary IOTS between i and j only 


For a better understanding of this definition and 


1s an upper bound for w, °. Now, o becomes 


the following step in the proof see fig. }. 
. becomes a maximum 


the minimal number of tasks is executed. Accord- 
ing to lemma 6, for every primary IOTS, this 
minimal number of executable tasks is two and, 


therefore, o = 2k. Then, from lemma 2 follows 


(a) pp ie See 
fe oe the smallest integer greater that k=h and, therefore, 
aan + 


(b) LE] denotes the greatest integer smaller 


than a 
m 
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me 3A 
Consequentl re are he ees 
q > gio 2D wT Dp TE mek, 
opt ee o+p 
then ESE QO and therefore < 2 
m otp —2° 
W 
opt 


Since the above estimate holds for an arbitrary 
sequence of primary IOTS's(which includes the 
case of only one IOTS, since it was defined in 
definition 8 that i < j), it holds for every 
sequence of primary IOTS's in a level schedule 

S for an N-free TDS G. Hence, it is also valid 
for the sum of all such sequences in S. If there 
exist some tasks in S which are executed in 
time slots that do not belong to any neighborhood 
U(i,j) in S,all tasks executed in these time slots 
according to schedule S are executed in sche- 


dule R also in one time slot. That is, in order 
to extend the above estimate to the lengths of 


the whole schedules S and R, it is necessary 


to add a constant c > 0 to the numerator and 


the denominator of the fraction 3. Since 

x+e x wy (G) 3 
<= for x,y,¢c > 0,.1t follows: ———y <= 

yte - y = opt (@) — 2 


and this completes the proof of theorem 1. 


To verify that this upper bound is the best 


possible, figure 5 shows a model of an N-free TDS 
w,(G) 
G for which ——7~y can be approximated 
| Wont (4) 


arbitrary close to 3. 


Figure 5: 


Assume k = nem, i.e. k is a multiple of the 


number of processors. Then it is easy to verify 


= ° = ° ~|)+nem. 
that opto) 2nem and w, (G) 2-n(m-1)+nem 
wy(G) am = 2n 313 aoe 
Hence, Woot (4) ~ "nm ~2 m 2 
m > co, 


The following diagram exhibits the correlations 
between the results we have obtained from our 
investigation of level schedules on N-free TDS 
and, particularly, from the notions of dominance 


and IOTS's. 


Figure 6: 


Dominance 
criterion is 
not violated 


iT 


Level sche- 


Schedule for 2 processors 
is optimal 


There is no primary IOTS 


during which only one task dule for an 
is executed. : N-free TDS G 
\ 
\ 
“My 


IV. Relevance of N-free structures 
to computer programs 


In the preceding section it has been shown that 
N-free TDS's are suited for level scheduling. In 
the following, we offer some representative 
examples which exhibit the relevance of these 
structures to computer programs. The TDS of a 
program can be considered as the representation 
of what we call the 'minimal control structure' 
of this program. This is to say that the TDS 
imposes only those precedence relations between 
tasks that have to be observed in order to 
accomplish the correct execution of the program. 


Consider, for instance, figure 7. 


Figure 7: b) 


The graph in figure 7a represents data dependen- 
cies between some tasks in a program. This TDS 
permits several sequential control structures 

for the program execution. In figure 7b we list 
two of them. All correct control structures for 
the execution of this program have in common 
that, for instance, task 4 is executed later than 
task 2 and task 3 or that task 6 is executed at 


least. 


For multiprocessor scheduling purposes, programs 
(or program segments) that feature a high degree 
of parallelism are of considerable interest. This 
is particularly true for iterative (nested) DO- 

loop structures [8]. In a previous paper we have 


shown that these structures, which constitute a 


subset of N-free structures, even yield optimal 


level schedules under certain conditions[5|. 


FORK-JOIN constructs are known as programing 
tools which allow the explicit specification of 
parallel executable task sequences[10]. Every 
FORK instruction can be considered as the 
realization of a branch task. Moreover, the FORK- | 
JOIN concept demands that every FORK instruction 
is associated with a junction point. At this 
junction point - represented in the TDS by a 
merge teak -~ all processes that have been initia- 
ted by the associated FORK instructions must 
recombine. This, however, implies that the FORK- 


JOIN concept does not permit N-structures. 


Improving program verification and program 
manageability are two main objectives of what 

is commonly known as structured programming [9]. 
Particularly in very large and complex program 
these properties are quite desirable. The general 
idea of structured programming is to permit only 
some elementary control structures within 


programs. These structures include: 


1) a single assignment statement, 

2) a conditional branch statement, 

3) an interation, i.e. the repeated execution of 
an elementary control structure as long as a 
certain condition holds, 


4) a sequence of elementary control structures. 


Obviously, the control structure of a structured 
program is N-free, if each of these elementary 
control structures is N-free. Since we are 
dealing only with deterministic scheduling, the 
data flow in a program is assumed to be known 

@ priori. For this reason, a conditional branch 
can be omitted in our considerations. Assignment 
statements or sequences of them are represented 
in the corresponding TDS as a single task. Hence, 
only the iteration remains to be considered. If 
the repeated execution of elementary control 
structures within an iteration can be done 
independently of each other, it is equivalent to 
a DO-loop structure the representation of which 


has already been treated. If, however, the single 


steps of an iteration must be executed in a 
sequential order, its control structure is similar 


to the sequence of elementary control structures. 


Every algorithm can be written using only simple 


instructions, conditional branches and interations. 


Consequently, every program, which is nothing 
but the description of an algorithm, can be 


written in a well structured form. 


Concluding remarks 


Of all the algorithms, by which the priority of 
tasks for scheduling purposes in multiprocessor 
systems can be determined, the level algorithm 
appears to be most suitable for practical appli- 
cations because of its simplicity. So far, it has 
been demonstrated that the level algorithm 
suffices to generate optimal schedules for TDS's 


that feature a tree structure. 


In this paper, a new type of TDS's can be present- 
ed which represents an extension of the tree 
structures and which reflects to a greater extent 
It has 


been shown that level schedules for this class of 


structures that appear in real processes. 


restricted TDS's provide considerably better 


results than for the general TDS's. 
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A FIXED-VARIABLE SCHEDULING MODEL FOR MULTIPROCESSoRS ‘ 
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Department of Mathematics 
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Lubbock, Texas 


Abstract ~- A model is presented for a 
scheduler which interfaces multiprocessing hard- 
ware and software to form a complete model of the 
dynamic effects of scheduling on a multiprocessor 
system. A fixed-variable scheduling philosophy 
is introduced which allows both the immediate 
scheduling of tasks in a high-demand situation 
and the careful improvement of schedules through 
arbitrarily complex algorithms as time becomes 
available to the scheduler. Hardware and soft- 
ware are determined for the creation of a direct 
mechanism for the implementation of this new 
scheduling philosophy, although only the fixed 
part is actually implemented. Experiments on the 
model enable several conclusions to be made re- 
garding the acyclic representation of cyclic 
graphs and the marginal improvements achievable 
by some global scheduling heuristics. 


I. The Scheduling Philosophy 


The problem of maximizing the efficient use 
of the computer's resources is hampered by the 
large number of system variables to consider. 
Even in the restricted case where task lengths 
and precedences are the only considerations, 
looking for optimal solutions becomes futile [19]. 
In this paper we accept the bounds on optimality 
presented by heuristic methods [7] and consider 
appropriate heuristics which might include some 
of the other parameters of importance. Some 
theoretical results [4] demonstrate the levels of 
complexity which are added by the consideration 
of the computer system as a general set of limit- 
ed resources, while others [11] specifically 
approach the problem of including memory restric-— 
tions in the scheduling algorithm. This rapid 
growth seen in analytical complexity with each 
new system parameter suggests simulation as the 
most immediately viable tool for studying the 
dynamic effects of scheduling in realistic system 
models. 


What we propose is a new philosophy of 
"fixed-variable" scheduling. The theory is based 
on the fact that there are certain duties the 
scheduler must perform as rapidly and frequently 
as possible in order to avoid undue delays for 
scheduling overhead. At the same time there are 
many interesting parameters we would like to in- 
clude in the scheduling algorithm if it were not 
for the time penalty. The solution comes from 
the fact that the demands made upon the scheduler 
are not constant; both the frequency of requests 
for tasks by the processors and the time penal- 
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ties for polling new tasks and implementing the 
scheduling algorithm will vary widely during the 
hours of operation on the multiprocessing system. 


The "fixed" part of the scheduling philosophy — 
is the implementation of a minimal scheduler in 
the following fashion. Static heuristics are 
used to pre-order the tasks such that delivery of 
tasks to a processor simplifies to a hardware 
function [8] with the only real overhead for the 
scheduler being the time required for polling the 
user programs for tasks ready to be scheduled. 

A high rate of polling (the asymptote being demand 
scheduling) requires too much time on the part of 
the scheduler to search for "ready" tasks. The 
alternative is a low polling rate which may cause 
the "ready list" to become empty or a high-prior- 
ity task to miss an early chance at execution. 
This trade-off creates a decision in the polling 
frequency which is best made dynamically. 


The "variable" part of the scheduling philo- 
sophy incorporates all of the various dynamic 
decisions which may be made. With the hardware 
producing a "reasonable" schedule of tasks at a 
frequency comparable to the highest rate of pro- 
cessor demand, the scheduling software need only 
"improve" the already existing schedule as time 
permits. That is, when the request rate for 
tasks is low, the scheduler executes a compli- 
cated algorithm involving a large number of 
dynamic variables, and when the request rate is 
high, the scheduler spends nearly all of its time 
polling in order to keep the "ready list" from 
becoming empty. Other decisions, such as the 
polling frequency, which and how many dynamic 
variables to consider, and the desired level of 
optimality in the algorithm which uses these 
variables, are made dynamically by the scheduler. 


The difficulty with this scheduling philoso- 
phy is in determining the answers to the dynamic 
questions just raised. Since it is difficult to 
measure the effects of the algorithms in this 
highly variable setting, and since the overhead 
penalties are heavily machine and program depend- 
ent, it becomes impossible to determine the 
answers to the dynamic decisions the scheduler 
must solve. In particular, this aspect of the 
scheduling model could not be included in the 
simulation experiments to follow, due to the lack 
of the appropriate overhead factors and the un- 
bounded possibilities for algorithms to test. 

The remainder of this paper investigates the 
appropriate parameters for the fixed portion of 
the scheduling model and leaves the variable 
portion of the model open to further investiga- 
tion. 


II. The System Model 
Assumptions Regarding the Host System 


The features which we consider as part of 
our model begin with a collection of independent 
processing units organized on a time-phased ring. 
All execution is handled by functional units 
shared by the processors, with the scheduling 
duties being performed by a dedicated processor 
similar to the one offered as part of Texas 
Instruments’ ASC [18]. The processing units re- 
quest tasks from. the scheduler upon completion of 
their previous assignments (in concordance with 
the results of [5]) while the operating system 
polls user programs to find tasks which are ready 
for scheduling. When the polling routine detects 
that all of the predecessors of a given task have 
been processed, the task is placed on a "ready" 
list where it waits for assignment to a processor 
by the scheduler. The scheduling processor also 
assigns tasks for interrupt processing as de- 
scribed in [6]. 


The aspects of memory which influence the 
scheduling model are memory interference and the 
availability of memory for tasks about to be 
executed. A variety of interference models is 
available [2] with the simplest being to assume 
that memory is shared randomly. Less interfer- 
ence may be achieved by the "home memory" concept 
[17] in which instructions and data used primar- 
ily by one processor are loaded in a memory 
module specifically assigned to that processor. 
However this requires that the processor assign- 
ment be known in advance for proper loading of 
the task, and further complicates the treatment 
of interference among the various resources in 
the system. The availability of memory for load- 
ing tasks also appears in [3] as a part of the 
scheduling algorithm. 


The "forking scheme" which determines the 
various parallel execution paths among the tasks 
is a function of the model used to describe the 
programs which are to be executed on the multi- 
processor. The responsibilities of the proces- 
sors, the processes, and the operating system for 
achieving the proper matching between tasks and 
processors is a matter taken up by [5]. The 
"ready list'' of tasks about to be scheduled is 
managed by the scheduling processor and is con- 
stantly being updated. A special controller 
serves as the interrupt mechanism and delivers 
the next task from the ready list upon processor 
request. If hardware monitoring is desired to 
improve the scheduling efficiency, that also is 
interpreted by the scheduling processor. 


Run-time anomalies, or variance in task 
attributes, is modeled in several stages. Vari- 
ance in task lengths due to data dependencies, 
looping instructions, or I0 waits within a task 
is determined by treating the task length as a 
random variable with only its mean and some 
assumed distribution being known to the scheduler. 
Variance due to contention for hardware resources 
is modeled analytically [10] to produce a "con- 
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‘instruction mix. 


or memory capacity. 


tention function" relating the "free" execution 
time of a task to its "limited-resource" execu- 
tion time. Variation which occurs in the 
interpretation of the program graph, such as 
branches and loops between vertices, has been 
studied by Martin and Estrin [14]. 


Representation of Tasks 


The tasks to be scheduled are considered to 
be vertices of a single program graph of the type 
discussed in [9]. The model for an individual 
task consists of its connecting structure in the 
program graph and its resource attributes. Since 
many of the operating system tasks may be consi- 
ered as service calls from the user program, they 
may be represented as subgraph modules of the 


user program, thus enabling the model to treat 


user and operating system tasks as indistinguish- 
able. In addition, it is anticipated that future 
research will enable the entire operating system 
to be modeled by a single program graph with sub- 
graph substitutions being performed dynamically 
to represent the flow of user jobs in the system. 
This makes the assumption in this paper that only 
one program graph exists in the system at a time 
a reasonable simplification. 


The origin of the program graph and descrip- 
tions of the individual tasks is not a major 
concern here. We assume that all user programs 
require an amount of computation time warranting 
their decomposition into several (parallel) tasks,. 
that they are properly compiled, checked for 
proper termination conditions, and linked through 
appropriate graph-module replacement algorithms 
into the operating system's program graph. Each 
task possesses a unique identity, a link to its 
creator (operating system or user job), its memory 


size and protection requirements, and estimates 


for its duration, variance of duration, and 

The instruction mix reflects 

the resource requirements in that it differenti- 
ates tasks with heavy usages of IO, arithmetic, 
The variance parameter allows 
the task to behave anomalously during execution 
after the scheduler has assumed the other esti- 
mated parameters to be exact. 


Several graph structures for programs to be 
input to the system were taken from [13] (e.g. 
Figure la). The scale of the tasks and distribu- 
tion of task weights were modified, however, to 
resemble a more balanced weighting of tasks that 
might result from a compiler analysis of user 
programs [8]. The desirability of having the 
compiler balance the task weights in this fashion 
becomes apparent when comparing expected results 
from heuristics in [1] with the worst-case bounds 
on scheduling anomalies expressed in [7] and [19], 
which are attained by considering asymptotic task 
relationships. Transformations were also made on 
the original graphs to make them adhere to proper 
termination criteria [9] and to make the graphs 
acyclic, replacing the cycles by multiplying the 
vertex weights within the loops by the estimated 
loop frequencies [14]. 


Our purpose is to simulate both the cyclic 
and acyclic versions of the graphs in order to 
determine the ralative advantages of each model- 
ing technique. The final transformation perforn- 
ed on the graphs made use of graph modules and 
replication vertices [9] to implement the cycles 
in the graphs directly. Returning to the origin- 
al cyclic graphs, each loop was replaced by an 
appropriate graph module controlled by one of the 
replication vertex forms provided in the extended 
program model. The result may be seen by compar- 
ing Figures la and b. Significant savings are 
accomplished in both the number of tasks to be 
specified by tables and in the simplicity of the 
overall graph. Although simulation proved to be 
considerably longer in terms of both simulated 
and real execution times in the modularized case, 
due to the repeated simulation of each loop for 
each replication, control structures of this kind 
must be investigated for the actual implementation 
of real programs in real multiprocessing systems. 


Modeling the Invoking of Algorithms 


Several scheduling algorithms have been sug- 
gested for use in multiprocessing systems. In 
this paper we consider a number of heuristics 
based on task lengths and program structure, 
leaving user job priorities for later considera- 
tion. Two approaches were taken in [7], one 
based on longest-task-first, the other on a 
lexicographic ordering from a labeling process 
giving unique assignments beginning from the exit 
vertex of the graph. Chen and Epley [3] used the 
latest allowable starting times, while Martin and 
Estrin [14] used a variety of heuristics, includ- 
ing first-in-first-out, shortest task first, 
longest task first, longest path to exit, expect- 
ed path length to exit, and number of successors. 
Adam [1] compares some of these with similar path 
calculations measured from the entry vertex. 


Additional algorithms appear with the 
inclusion of memory requirements as a recognized 
scheduling parameter. The work reported in [3] 
considers both channel speed and memory size in 
finding schedules, while [11] introduces a "two- 
dimensional" scheduling strategy which uses an 
optimistic initial guess for the schedule comple- 
tion time to initiate an iterative trial method 
for finding optimal schedules satisfying the 
memory requirement. Analysis of these and other 
memory-oriented task-scheduling algorithms is 
available in [12]. The limitation on these 
algorithms is that they are intended for multi- 
programming systems where there are no inter-job 
dependencies. We wish to study multiprocessing 
systems where partial orderings exist between 
tasks due to the structure of the program graph 
of which all tasks are a part. 


The heuristics chosen for study may be 
divided into three groups: local, global, and 
comparison standards. The local algorithms 
(those in which priorities are calculated from 
the parameters of the task alone) are most imme- 
diate successors first (MISF), longest time first 
(LTF), shortest time first (STF), and largest 
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memory first (LMF). The global algorithms (those 
in which priority calculations require knowledge 
of the graph structure and the priorities of other 
vertices) are most total successors first (MSF), 
highest level first (HLF), highest weighted level 
first (HWLF), longest expected path length first 
(LPF), and longest weighted path length first 
(LWPF). The standards of comparison are first-in- 
first-out (FIFO) and random selection (RAND). The 


_ reason for having two comparison standards is that 


FIFO is the simplest to implement (no sorting at 
all), while purposely randomizing the list elimi- 
nates any bias that may have been entered by the 
polling algorithm. 


The amount of difficulty to compute each of 
these heuristics is determined primarily by the 
same group classification. The comparison 
standards are essentially no work at all, while 
the local algorithms simply select one of the 
task attributes and assign it as a priority. The 
global heuristics, however, require more computa- 
tion. The "level" of a vertex is defined as the 
number of vertices in the longest path from that 
vertex to the exit vertex (inclusive), while the 
"expected path length" is defined as the combina- 
torial average of the probabilities of each 
possible path to the exit, based on the output 
probabilities associated with each vertex. (The 
actual calculation of these heuristics can be done 
by a polynomial process [15].) For the "weighted" 
algorithms, the number of vertices in each path is 
replaced by the sum of the estimated times for 
each vertex in that path. 


The algorithms we have discussed may be in- 
voked either statically or dynamically. The 
static or list-schedule technique is to order all 
tasks prior to execution according to whatever 
heuristic is chosen, and to assign the list mem- 
bers to processing units on a first-come-first- 
served basis. More complicated schemes may do 
the assignment in advance as well by creating a 
separate list of tasks for each processing unit. 
The dynamic approach differs in that the task 
priorities may be modified at run-time as more 
information becomes available about each task and 
and the status of the program graph. This is 
particularly true of global heuristics, since 
they are based upon graph structures, where 
already-executed portions of the graph may be 
removed and the remaining task priorities recal- 
culated. While the dynamic method produces better 
schedules in general, due to increased knowledge, 
it suffers from the problem that an inordinate 
amount of time may be spent recalculating the 
task priorities. 


The basic strategies chosen for testing in 
this paper are invoked in a mixed fashion, in that 
the scheduling priorities are pre-calculated 
(thus static) while the tasks are assigned to 
processors dynamically upon processor request. 

The "ready" queue is maintained by the operating 
system polling user jobs for tasks available to 

be scheduled. Tasks are selected from the queue 
during program execution, based upon the immedi- 
ate resource availability and the priority 


assigned to each task by the heuristic. 


III. The Experiments 


Input Selection and Validation 


A number of experimental machine configura- 
tions and scheduling philosophies may be repre- 
sented by the model we have discussed. In 
addition, the class of job mixes which can be 
modeled for execution on this system is unbounded. 
We wish to describe specific configurations and 
mixes in order to exercise the simulation under 
different scheduling strategies and ascertain the 
effects of these strategies on the overall system. 
To achieve this it becomes necessary to gather 
data from a variety of sources in order to deter- 
mine the hardware configurations, resource 
descriptions, and hardware instruction mixes for 
the machine model, and to select sample program 
graphs, task descriptions, and resource require- 
ments for the programs to be executed on the 
modeled system. 


The simulations were run on a Xerox Sigma 5 
using a FORTRAN and assembly language system to 
implement a portion of the SIMULA 67 class con- 
cept. The simulation program operates on a 
discrete system-state basis wherein state-tables 
are modified as events such as job-entry, task 
completions, scheduling functions, and processor 
requests for tasks occur. The resource competi- 
tion and configuration parameters resemble those 
of [16] in that lists of resources, competitors, 
priority rules, service times of resources, and 
load variance conditions describe the particular 
hardware being simulated by that run. Anexterior 
event generator is used to create the program 
graphs and task parameters which are to enter the 
simulation as exogenous events. A general mix of 
programs, including highly parallel, intermediate, 
and serial jobs, is generated in the form of 
tables produceable by compilers, and "executed" 
on the simulated hardware model. The simulation 
is then exercised under different scheduling 
strategies in order to ascertain the effects of 
these strategies on the overall system. 


The general hardware configuration studied 
was a 32-processor system [8]. When it was de- 
termined that sufficient program loads could not 
be generated within the limits of the simulation 
host computer, smaller numbers of processors were 
sampled. The hardware timing parameters were | 
chosen on the basis of currently available hard- 
ware units, with the number and types of resource 
units selected empirically to fit the instruction 
mixes anticipated. In order to determine the 
instruction mix parameters, traces were taken on 
the XDS Sigma 5 from what were considered to be 
typical programs: a double-precision FORTRAN 
subroutine, a series of FORTRAN IO statements, 
the XPL/S compiler, and a sample run of the 
SIMTRAN simulation package. The results of the 
traces are reported as percentages of instruc- 
tions counted in Table 1. Over 600,000 instruc- 
tions were counted for each mix. 
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With the variety and volume of input data and 
model parameters available, it becomes necessary 
to make some preliminary studies of the data and 
calculations made upon it to determine their 
validity and appropriateness for use in the simu- 
lation model. Since the memory interference 
function must be re-evaluated for each new task 
to be executed by a processor, its behavior, 
particularly relating to convergence speed, was 
studied under a variety of parameter values to 
determine the numerical accuracy of the analytical 
solutions [10]. These calculations were also 
helpful in selecting the numbers of processors and 
memories and cache hit ratio to be included in the 
simulation. : 


Some interesting observations may be made 
from the duplicate runs on the cyclic and acyclic 
versions of the test graphs. While the cyclic 
representation may be considerably more accurate 
in representing the flow of control in the actual 
program, the penalty in simulation time is severe. 
For a loop that is replicated k times, k times as 
many tasks must be scheduled, resulting in many 
more polling and scheduling delays. [In the 
example of Figure 1, 21.7 times as many vertices 
were executed with a 22% increase in elapsed time. 
These time increases were reflected to a much 
greater degree in the computation time to perform 
the simulations. It may be concluded from this 
that, depending upon the level of detail desired, 
either method of modeling cycles may have merit. 
It is necessary to be capable of modeling the 
cyclic execution, both for understanding of the 
model and for implementation of real schedulers, 
but the overhead makes it not worth the effort in 
a larger-scale view of the system, particularly if 
the loop count is very large. 


Variance and Load Parameters 


The subject of variance due to run-time 
anomalies is an inescapeable part of the multi- 
processor system and is included as part of our 
model. Variance in the instructions executed 
within a task, variance in executing power due to 
resource contention, variance in decision branches 
between tasks, and variance from exogenous vari- 
ables such as system load characteristics each 
cast their effect on the overall system. Specific 
studies of these effects have been performed 
previously [14] and need not be repeated here, 
but they are still present in the simulation and 
must be accounted for in a statistical sense. 
Results from single runs on any of the experi- 
ments reported here tend to lack credibility 
unless we can be certain that the range of the 
results is clearly within the range of the vari- 
ance involved in the experiment. 


Since processing time constraints on the 

simulation make it impossible to repeat all of 

the experiments sufficiently to reduce the vari- 
ance in the values reported, a series of 10 runs 
with different random—-number seeds was made on 
each of the sample graphs in order to establish 

an estimate for the range of credibility of future 
experiments. Simulations of one graph executed an 


average of 46 vertices in 1023 time units, requir- 
ing a total of 1414 units of execution from the 
processors, with a variance of 14%. In the next 
graph simulated, 145 vertices were executed in 
time 590, requiring 3611 processor-time units 
with less than 1% variance. Simulations of a 
third graph completed 124 tasks in time 1974 with 
8% variance and required 2376 processor-time units 
with 15% variance. In the final graph (Figure 1) 
204 tasks were completed in time 8494 with 2% 
variance and 33356 processor-time units were re- 
quired with only 5% variance. The results were 
similar or better when the modularized graphs 
were tested in the same manner. In the remainder 
of this paper, a parameter will be considered 
significant only when its effect exceeds the 
variance from these preliminary runs. 


One form of variance not studied by [14] is 
that caused by resource contention. When the 
system is under sufficient load, tasks with 
similar instructions mixes will conflict over 
resource requests and cause variance in the task 
completion times. In order to observe this 
phenomenon, a full load is necessary on the system 
resources which cannot be achieved by a single 
graph, and increasing the size of the graph 
creates space problems in the simulation. If we 
limit the size of the system, considering the 
graph under study to possess only a portion of 
the total system, then the model weakens because 
no interaction exists with the jobs assumed to 
hold the remaining system resources. What is 
required is a means of occupying the other pro- 
cessors not immediately concerned with the current 
graph in such a way that they simulate the 
contention effects of having other jobs in 
system. 


the 


One solution is to introduce "dummy" tasks to 
the system. Whenever a task is to be scheduled 
(found "initiable" by the token machine) a set 
number of "ghost" copies of the task are entered 
in the ready list. This not only causes other 
processors to become occupied with tasks which 
will have an effect on the resource contention 
calculations, but also produces a load on the 
scheduler and may cause the real tasks from the 
graph to wait for "dummies" to complete, much as 
they would have to in a real system containing 
other jobs from a general mix. The "ghost" tasks 
differ from the real ones in that they merely 
"die" upon completion of their processor-computa- 
tion time and do not cause tokens to be moved as 
in the case of "real'' tasks. The number of . 
"shosts" created for each "real" task is taken as 
a system load parameter. 


The effects of the load parameter on the 
system are shown in Figure 2, where each run 
assumes all tasks to be from the same one of four 
instruction mixes. The idle time for each mix is 
graphed as a function of the number of ghost 
tasks produced, and as would be expected, the 
effects of the different mixes becomes more dis- 
tinct as the load increases. The next experiment 
examined the effect of "balancing" the instruc- 
tion mix on the system by randomly assigning a 
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different instruction mix to each task. The re- 
sult was compared with an average taken from the 
runs made with the separate mixes, in order to 
show the advantage of scheduling an equal number 
of tasks from each mix. By balancing the instruc- 
tion mix on the system in this way, the scheduler 
may avoid contention bottlenecks which occur when 
all the processors are executing tasks which re- 
quest the same resources. The difference found, 
however, was determined insufficient to justify 
including the mix parameter as part of the sched- 
uling priority. 


Another factor to consider is the source of 
the idle times that are being measured. We have 
considered a system where the operating system 
polls its users for tasks available for scheduling 
and made the frequency at which the polling occurs 
a simulation parameter. When at least one task 
has completed, a flag is set enabling the polling 
process-object. When the appropriate time inter- 
val expires, the graph is polled, or scanned by 
the scheduler. We may model demand scheduling 
with this model by simply setting the polling 
interval to an extremely small value, such that 
polling will occur precisely whenever the flag is 
set. The difference in idle time between a poll- 
ed and a demand system was found to be signifi- 
cant. Because of this difference, the remaining 
simulations were performed under demand scheduling 
so that all of the idle time reported may be 
directly attributed to the scheduling algorithm 
and/or the configuration under study. 


Comparison of Scheduling Algorithms 


The idle times of the various scheduling 
heuristics described in Section II are compared 
in Figure 3 as functions of the number of proces- 
sors in the system. In every case we see that 
the local algorithms (MISF, LMF, LTF and STF) are 
poor in that they show no improvement over the 
comparison standards (FIFO and RAND). The global 
strategies involving some measure of path length 
(HLF, HWLF, LPF and LWPF) show consistently better 
performance, particularly in the 4-6 processor 
region. The pseudo-global urgency MSF gives some 
improvement over the local urgencies but is not 
as good as the path length algorithms. Little 
difference can be seen between the four global 
strategies. 


At this point it becomes necessary (for _ 
limitations in computer time) to fix the level of 
load on the system. This may be accomplished by 
selecting a specific processor range since the 
same job stream is used for simulating each con- 
figuration. We may observe that for lightly 
loaded systems (characterized by large idle times) 
all scheduling algorithms will tend to perform 
equally well because there is no real contention 
requiring a careful scheduling decision. For 
moderately-to-heavily loaded systems (character- 
ized by smaller idle times) the distinguishing 
performance characteristics of the algorithms will 
emerge. We select the 4-6 processor range since 
it seems to typify the moderately-loaded situa- 
tion. 
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Figure 2 - Comparison of the Instruction Mixes Under Load 
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Figure 3 - A Comparison of Several Scheduling Algorithms 
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One goal in this research was to find a 
local strategy that would perform near to the 
global strategies or to determine which of the 
global strategies would perform best. It was 
hoped that when memory becomes a critical re- 
source that the LMF algorithm might out-perform 
some of the others. For the remaining simulation 
runs, a limit is placed on the total memory 
occupancy at a given time in the system, causing 
the shceduler to pass over some tasks with large 
memory requirements for smaller ones if they do 
not fit in the available memory. (Contiguity of 
memory is ignored here.) It is expected that as 
memory becomes restricted, all of the non memory- 
oriented strategies will degrade rapidly in their 
performance to the point where a simple memory 
strategy like LMF may succeed. 


For this reason an investigation was made 
into the memory capacity limit as reported in 
Figure 4. The LMF algorithm is shown for a few 
processor values over a wide selection of memory 
limits. Dotted lines in the figure indicate the 
performance in the case where an infinite memory 
is assumed (i.e. when the memory parameter is 
ignored by the loader). The lowest memory limit 
considered was 410 because the largest simulated 
task will just fit in that space. The result of 
this investigation is that memory limits of 410, 
600 and 800 should serve as an appropriate test 
range for systems within the established ideal 
load range of 4-6 processors. 


In the final simulation series, the global 
strategies were run under varying memory capaci- 
ties in order to see how they measure against the 
LMF algorithm as memory becomes a critical 
resource. The results for HLF, HWLF, LPF and 
LWPF were nearly identical, so HLF was chosen as 
being representative for the comparisons with 
RAND, FIFO and LMF. The effect of the memory 
constraint on each algorithm's performance is 
shown in Figure 5 for each of the memory limits 
under consideration. FIFO appears to be consis- 
tently better than RAND, although not much 
difference is noted. The performance of LMF is 
considerably disappointing. The path length or 
global algorithms are again clearly better than 
the others, although the margin of improvement 
dwindles rapidly when memory becomes critical. 

IV. Conclusions 

A number of conclusions may be drawn from 
the experiments performed here, some positive, 
some negative. A significant step has been made 
in the modeling of multiprocessor systems. A 
model has been successfully constructed which is 
capable of modeling the complete system from many 
different points of view and at many levels of 
detail. We have also found that the simpler 
acyclic program model is reasonable in many in- 
stances, that demand scheduling is highly 
desirable if the scheduling processor can be 
spared to perform the task often enough, and that 
more work is needed in specifying efficient 
implementation details of the scheduler's duties. 
In the way of scheduling algorithms, we have 
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found that the local heuristics are of no help at 


all, 


that all of the global strategies listed are 


of equal value (therefore pick the cheapest one to 
implement), and that the usefulness of the global 
strategies is limited when the time spent on 
determining these priorities is a critical concern. 
We were also disappointed by the failure of the 
LMF attempt at introducing the memory parameter to 
scheduling in an inexpensive manner. 
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Table 1 - Typical Instruction Mixees”? 
FORTRAN FORTRAN XPL/S SIMTRAN 
Floating IO Compiler Simulation 

Category of Instruction Point Statements 
Memory Addressed Instructions - 8870 . 7685 itt 8828 
Immediate Addressed Instructions -1130 2019 . 2228 viLi2 
Total Memory References 1.6696 1.3932 1.4844 = 1.4354 
Instruction Fetches 1.0000 1.0000 1.0000 1.0000 
Operand Addresses 7244 S373 ~5342 - 4930 
Operand in Memory 6360 3462 4636 4085 
Register Operands | 0884 0251 -0706 .0845 
Operaud Fetches ~ .5916 .2201 3980 3792 
Operand Stores ~1328 ai Ba a 4 1363 1138 
Indirect Memory References 0336 .0470 0208 .0269 
Total "P.c' Instructions 5367 6727 . 7067 .7732 
Branch Instructions 1626 3972 . 2430 . 3898 
(Successful Branches) ~1501 -2912 . 1696 3281 
Load/Store Instructions ~3675 . 2726 4637 . 3834 
"Execute" Instructions .0066 .0029 .0000 .0000 
Total "D.e'' Instructions 4699 . 3939 .3147 . 2420 
Int. Add, Subtract, Compare 0714 3282 2431 1428 
Floating Add, Subtract - 1668 -0003 .-0000 0197 
Multiply Instructions 1339 -0079 -0001 .0133 
Divide Instructions 0371 0024 0003 . 0006 
Logical Instructions .0134 .0149 .0313 .0115 
Shift Instructions 0473 0400 0471 0138 
Monitor and IO Calls -0000 0002 0003 0403 


(b) values expressed as ratios to the number of instructions counted. Over 600,000 instruc- 
tions were counted per mix. 


117 


PERFORMANCE EVALUATION OF A PARALLEL SYSTEM 
PROCESSING FAULT-TOLERANT PROGRAMST 


K. H. Kim and M. J. Jenson 
Department of Electrical Engineering 
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Los Angeles, California 90007 


Abstract, A parallel (multiprocessor) 
system processing fault-tolerant programs was 
developed in [4,5]. The system performance 
is evaluated in this paper, using an analytic 
approach based on stochastic models. The 
analysis confirms the high effectiveness of a 
' parallel system, under all practical circum- 
stances, in reducing the program execution 
time increase due to run-time validation and 
system state saving. It also shows how the 
system performance is affected by various. 
program characteristics. | 


1. Introduction 


A system architecture for parallel exe- 
cution of fault-tolerant programs (i.e., pro- 
grams containing redundancy for the tolerance 
of residual program errors and/or hardware — 
faults [7]) was developed in [4,5]. The system 
was designed to execute block-structured 
fault-tolerant programs developed by Horning 
et al. [3]. A fault-tolerant block or recovery 
block is the basic component containing re- 
dundancy in these programs and has the fol- 
lowing structure: ensure T by O,; else-by O2 
else-by ... else-by On else-error, where T 
denotes the validation test, O, the primary 
object block, and O, (1<ksn) the alternate 
object blocks. All of the object blocks in a 
fault-tolerant block F compute the same or 
approximately the same objective function. 
The validation test T is executed on exit from 
an object block to confirm that the object 
block has performed acceptably. The exe- 
cution of a validation test results in either 
an acceptance (i.e., confirmation) or a re- 
jection. If accepted, control exits from the 
fault-tolerant block. If the result produced by 
an object block is rejected, the next alternate 
is entered. After the alternate object block 
finishes its computation, the validation test is 
repeated. Before an alternate object block is 
entered, the system state is restored to the 
state that existed just before entry to the pri- 


mary object block [1,2,3]. To enable this, 
a state vector that contains the values of all 
the variables (that may be changed by the 
object blocks) is saved on entry to a fault- 
tolerant block. 


The goal of the parallel execution is to 
overlap, as much as possible, execution of 
object blocks with the validation and system 
state saving. In this paper, we evaluate the 
performance of the parallel system. The 
approach used in this paper for performance 
evaluation is of an analytic nature and is 
based on stochastic models for both the parallel 
system and the sequential system (i.e., one 
in which the execution of an object block is 
not overlapped with the execution of a validation 
test). The evaluation shows the performance 
gain by parallel execution over sequential 
execution. 


In the next section major characteristics 
of both an efficient sequential system and a 
parallel system are compared. Section 3.1 
deals with the evaluation of the sequential 
system. Performance of the parallel system 
is evaluated in Section 3.2 and compared with 
the performance of the sequential system in 
Section 3.3. 


2. Distinguishing Characteristics 
of a Sequential System and a Parallel System 


In this section two systems, a sequential 
system using a memory organization called a 


recovery cache [1,3] and a parallel system 


using a duplex memory [4,5], are briefly 
sketched. 


The essence of the recovery cache 
scheme is to save the "original value'' of each 
non-local variable W together with its logical 
address right before the variable is modified 
for the first time in a new object block. The 
original values are thus saved in a compact 
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table structure. For illustration, the fault- 
tolerant program in Figure la is used. 


Figure lb shows a snapshot of the re- 
covery cache taken when primary. object block 
O21 is in execution. As Shown, there is a 
stack, called the cache stack, used for saving 
the original values. Similar to the main 
stack, the cache stack is also divided into 
regions, one region for each nested fault- 
tolerant block in the "active'' state (i.e., a 
fault-tolerant block that has been entered but 
not exited). The top region of the cache stack 
in Figure lb contains previous values of non- 
local variables together with their names (re- 
presenting logical addresses), i.e., Y2, Xl, 
X2, which have been modified during execution 
of the current object block Oo 1° Similarly, 
the bottom region of the cache stack contains 
the previous value of non-local variable Xl 
which had been modified by execution of object 
block O,; , before Oz 1 was entered, Figure 
lb also shows a flag field in the main stack. 
The flag attached to a variable indicates 
whether the original value of the variable has 
already been saved since the current object 
block was entered. Thus the flags attached to 


Y2, Xl, X2 in the main stack are currently set. 


If the result produced by execution of 
Or 3 fails the validation test V,, then the top 
region Cy of the cache stack can be used to 
reset the main stack to the state that existed 
on entry to fault-tolerant block F>. If it 
passes the test, execution of F’, is complete 
and C5 is merged into C, so that the result 
will contain previous values of those variables 
which are non-local to O, ,; and have been 
modified since O; ; was entered. Thus the 
result will be a single region containing (X1, 9) 
and (X2,2). Flags in the main stack are also 
adjusted such that only flags of Xl and X2 are 
set. Therefore, the combination of the main 
and cache stacks usually contains information 
with which several old state vectors can be 
reconstructed. 


In the case of parallel execution at least 
two processors are used, a main processor 
for object block execution and a VR-(validation 
and recovery) processor or audit processor 
for execution related to validation and recovery. 
It is necessary to save a state vector on exit 
from an object block since the state vector is 
used by both the main processor and the VR- 
processor. This is accomplished by simul- 
taneously storing the operand of each WRITE 
operation into two locations, one in the main 
stack and the other in the VR-store. When 
the main processor enters a fault-tolerant 
block F, a VR-store-segment is created to 
keep an execution image which consists of 
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records of assignments made by an object 
block in F. A VR-store-segment consists of 
two sections, the L-(local variable) section 
for keeping records of assignments to variables 
local to the object block in execution and the 
N-(non-local variable) section for assignment 
records of non-local variables. <A variable 
local to the object block being entered is 
allocated one location in the main stack and 
one location in the L-section of a VR-store- 
segment. New values assigned to variables 
that are non-local to the object block in exe- 
cution are recorded together with the logical 
addresses (of the variables) in a table struc- 
ture in the N-section of a VR-store-segment. 


For illustration, Figure lc shows the 
content of the VR-store at an instant during 
execution of the program in Figure la by a 
parallel system using a duplex memory. 

When the main processor entered the program 
(i.e., the outermost block), VR-store-segment 
So was created to keep assignment records 

of local variables Xl and X2. Since there are 
no variables non-local to the outermost block, 
Sg does not contain a N-section. When the 
main processor entered F,, VR-store-segment 
S}].was created. When non-local variable Xl 
was assigned the value ''8'' during execution of 
object block O],1, a table entry (X1,8) was 
made in Sin: Similarly, So was created when 
the main processor entered F, and was filled 
by execution .of object block O7 ;. The content 
of the main stack in a duplex memory is that 
in a recovery cache minus the flag field. 


On completion of O2 ;, the main pro- 
cessor proceeds to the execution of F3 (which 
will be imaged in a new VR-store-segment S3) 
while the VR-processor starts examining the 
execution image in S> by execution of Vo. If 
the result produced by execution of O 7 ; (kept 
in S>) fails the validation test V>, then the 
non-local variables recorded in Sen (and S3n> 
if not empty) are those which need to be reset, 
Segments Sg and S, contain the values of the 
variables that existed when the main processor 
entered fault-tolerant block F 2 and their values 
may be used to reset the main stack. A 
duplex memory may be implemented such that 
the previous value can be obtained in a single 
content-addressable memory (CAM) cycle [4,5]. 
If the result of Op , passes V>, So;, is dis- 
carded and 55, is merged into 5S); so that the 
result contains the assignment records, of the 
variables addressable in QO] 1, made since 
O;,, was entered. This will result in 5), 
containing '"1'',''5" and "3" for Yl, Y2, Y3, 
respectively and Sin Containing (X1,7) and 
(X2, 8). 


_ duplex memory. 


Let us now compare the characteristics 
of the recovery cache scheme for sequential 
execution with the characteristics of the duplex 
memory scheme for parallel execution. 


l. In both schemes, content-addressable 
memory modules are needed to obtain an 
acceptable level of performance in program 
execution and in the rest of this paper, the 
use of CAM modules is assumed. 


2. The duplex memory takes more space 
than the recovery cache, 


3. The WRITE operation into a non-local 


variable W involves two steps with the recovery 
cache, the first step being used for fetching 
the original value or the flag, while the WRITE 
operation takes one step (CAM cycle) with the 
duplex memory. Therefore, the execution of 
an object block is slower with the recovery 
cache than with the duplex memory. 


4. Overall, it is expected that the re- 
covery cache takes less merging time than the 
During the execution of a 
program in which no fault-tolerant block is 
nested within another fault-tolerant block, there 
is no merging involved with the recovery cache. 


5. The parallel system is slower in re-. 
covery because (a) recovery of a variable takes 
more steps with the duplex memory than with 
the recovery cache and (b) there are more 
variables that need to be recovered in the 
parallel system because while an execution 
image is being validated, the main processor 
normally proceeds to the successor block(s), 


In summary, the parallel system largely 
trades recovery time increase for the reduction 
of total program execution time. There are 
cases, though highly impractical, where the 
performance of the parallel system is inferior 
to the performance of the sequential system. 
Let @w denote the reliability of an object block, 
i.e., the probability of an average object block 
producing an accepted execution image. Then 
there is a lower bound @, for @ such that when 
~a>a@,. , the parallel system performs more 
efficiently than the sequential system. This 
lower bound is one of the values of interest 
examined in subsequent sections. 


3. Performance Evaluation 


Given a fault-tolerant program, the aver- 
age execution time of a fault-tolerant block is 
defined as the execution time of the program 
divided by the number of fault-tolerant blocks 
executed during the program execution. T, and 
denote the average execution time of a fault- 


T 
tolerant block by the sequential system and by 
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the parallel system, respectively. The system — 
throughput is defined as the number of fault- 
tolerant blocks completed per unit time and is 
given by the inverse of the average execution 
time of a fault-tolerant block, We denote the 
sequential system throughput and the parallel 


system throughput. by THR, and THR, res- 


pectively. Throughputs are used in this section 
as measures of the performance of the se- 
quential system and of the parallel system, 


For mathematical tractibility, the following 
set of global assumptions have been adopted 
throughout the performance evaluation, 


Assumption G 


G.1 The programs considered in this analysis 
are of the type in which no fault-tolerant block 
is nested within another fault-tolerant block and 
whose execution becomes a sequential chain of 

fault-tolerant block executions (Figure 2). 


G.2 Primary and alternate object blocks take 
the same average execution time. 


G.3 Each fault-tolerant block contains an un- 
limited number of alternate object blocks (to 
eliminate the case of program failure). 


In executing a program satisfying assump- 


tion G.1, the sequential system does not involve 


assignment record merging, as mentioned in 
Section 2. This assumption G.1 is adopted 
because of the difficulties in (1) dealing with a 
large spectrum of legitimate program structures, 
(2) keeping accounts of various execution times 
during execution of a general program (i.e., 

a program in which fault-tolerant blocks are 
nested one within another), etc. However, it 
is conjectured that results in this paper of 
performance comparison between two systems 
for programs satisfying G.1 will not be far 
different from the results for general programs. 


3.] Throughput Evaluation for the Sequential 
System 


The behavior of the sequential system 
during execution of a fault-tolerant block is 
depicted in Figure 3a. The system first enters 
the "object block execution" state s, in which 
the processor executes an object block within 
the current fault-tolerant block. On completion 
of an object block, the system enters the 
"validation" state s, in which the processor 
executes the validation test. If the validation 
results in a rejection, the system enters the 
'recovery'' state s,, and on completion of the 
recovery, the system again enters s, in which 
the processor executes an alternate object block. 
If the validation results in an acceptance, the 
system proceeds to the execution of the succes- 


sor fault-tolerant block and repeats the above 
behavior. . 


During execution of fault-tolerant programs 
satisfying assumption G, the sequential system 
continuously repeats the process depicted in 
Figure 3a. We thus model the system behavior 
by the following stochastic process for the pur- 
pose of evaluating THR, 


Model S 


S.1 There are three states which the sequen- 
tial system may enter: s,- object block exe- 
cution, s,,- validation, and S,- recovery. 

(Due to assumption G.1 there is no merging 


state, ) _ 


S.2 The time during which the system is in 
any state 1s exponentially distributed. 


S.2,1 When the system is in state s,, the 
rate gs of generating an execution image (i.e., 
the probability of the system completing the 
execution of an object block within an infinite- 
Simal time interval At is gs At), is gs= l/to, 


where tis denotes the mean object block eat 


cution time in the sequential system. gs is 


called the generation rate, 


S.2.2 When the system is in state s,, the 
rate v of completing the validation, called the 
validation rate, is v=1/t, where t, denotes the 
mean validation time. 


S.2.3 When the system is in state s,, the rate 
rs of completing the recovery, called the re- 
covery rate, is rs=1/t,, where t,, denotes 


the mean recovery time in the sequential system. 


S.3 The probability of the system entering 


state s, after leaving state s, is qa, while the 
probability of entering state s,, is @'=1-a, 
Figure 3b depicts Model S. Let Po? Py» 


Py denote the equilibrium probabilities 6] of 
the system being in s,, s,, 8,, respectively. 
The steady-state behavior of the system is 
expressed by the following equilibrium equations. 


Po Se be Sek Bo ee 
Be DA eee (1) 
Pot p+ 22> 1 (normalizing equation). 


Solving Eq. 1, we obtain 


P= rs+v/(gs-vra +rs-v + gs°rs) 


(2) 


P= rs-gs/(gs*vea + rsev + gs:rs) 
P..= gsiv-@/(gsevea' 4+ rs*v.+ gs°rs). 


By definition system throughput is equal to the 
number. of execution images accepted per unit 
time. Throughput THR, and its inverse Ty, can 
thus be obtained as follows. 
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THR, = p “ve 


Vv 
= rs-gs-v-a/(gs'veg + rs*v+ gs-rs) 
| ( 3a ) 
T= 1/THR, 


= (gs-vea t+ rsev t+ gs-rs)/(rs-gs-*v-q@) 


= (1/a)+(t,.+t) + (2'/a) +t. ( 3b ) 


5.2 Throughput Evaluation for the Parallel 
System 


In most cases the main processor need 
not be synchronized with the VR-processor. 
However, when the next fault-tolerant block to 
be executed specifies irreversible actions of 
critical nature, the main processor waits until 
the VR-processor accepts all the execution 
images in the queue (i.e., the execution images 
of the predecessor fault-tolerant blocks) [4,5]. 
An execution image generated immediately be- 
fore a block specifying an irreversible action 
is entered, is a ''synchronizing'' execution 
image (or for short, S-image). The other 
execution images are ''normal'' execution images 
(or N-images). | 

An abstract representation of the parallel 
system with unbounded queue is shown in 
Figure 4. The main processor continuously 
constructs execution images and puts the com- 
pleted execution images into the queue of exe- 
cution images except when (1) the VR-processor 
stops it on rejection of an execution image and 
enters the recovery state, or (2) the main pro- 
cessor has generated a synchronizing execution 
image and put it into the queue. The VR- 
processor validates execution images in the 
order of their arrival. When it accepts an 
execution image, it enters the ''merging" state. 
On completion of merging, it checks if another 
execution image is waiting in the queue. If an 
execution image is rejected, the main processor 
is stopped and recovery is initiated. Recovery 
involves a sequence of assignment reversals 
using the assignment records in the execution 
images and thus can be thought of as a process 
of "erasing" the execution images in the queue. 
On completion of the recovery, the queue is 
empty and the main processor is restarted. 
The parallel system is thus modeled by the 
following stochastic process. 


Model_P 


P.1 The state of the system at any instant is 
characterized by (1) the state of the VR-pro- 
cessor which may be in wait, validation, 
merging or recovery, and (2) the number and 
types of execution images in the queue. The 
state of the main processor is busy or waiting 


and is determined by the state of the VR-pro- 
cessor and the state of the queue. Thus each 
system state is denoted by 


“VR-processor. state, queue state ’ 
where (1) VR-processor state = w (wait), v 
(validation), m (merging), or r (recovery), 
(2) queue state = § (empty), N (one normal 
execution image), S (one synchronizing execution 
image), $ (=N or S), NN, NS, $N, $S, NNN, 
NNS, $NN, $NS, ... . 


and 


Some possible states of the system are 
shown in Figure 5, where some possible state 
transitions are also indicated. For example, 
sy oN is the state where the queve contains one 
normal execution image which the VR-processor 
is validating. There are four states which the 
system may enter from Sy N? SyNN which is 
entered if the main processor generates another 
normal execution image; s_ NS which is entered 
if the main processor generates a synchronizing 
execution image; 5,, ¢ which is entered if the 
VR-processor accepts the normal execution 
image in the queue; and sy ) which is entered 
if the VR-processor rejects the normal exe- 
cution image in the queue. In Se oN the system 
erases the normal execution image in the queue 
and thereafter enters state Sr 9 in which the 
system erases the partially constructed exe- 
cution image contained within the main proces- 
sor. Note that the type of the first image in 
the queue is not distinguished in some states 
(€.8., Sm, $N)> This is because once an 
execution image is accepted, the system's 
future behavior is independent of the type of 
the execution image just accepted. 


P.2 The time during which either processor 
is in a particular state is exponentially’ dis - 
tributed, 


P.2.1 When the main processor is in a busy 
state, the generation rate gp is gp=1/t, 
where typ represents the mean object block 
execution time (which is different from t)g). 


P.2.2 When the VR-processor is in a valida- 
tion state, the validation rate v is v=1/t,, 
where t,, represents the mean validation time. 


P.2.3 When the VR-processor is in a merging 
state, the rate mp of completing the merging, 


ae the merging rate, is mp=1/t,, mp where 


tmp represents the mean merging time. 

P.2.4 When the system is in a recovery state 
other than Seige the rate rp of erasing an 
execution image, called the recovery rate, is 
rp=1/t,, where t,, represents the mean time 
for erasing an execution image. - 
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P.2s5 


The size of the partially constructed 


execution image remaining within the main 
processor when the system enters a recovery 
state is assumed to be proportional to the 
amount of time that the main processor has 
spent in construction of that execution image. 
Borrowing a result in the renewal theory, the 
mean size of the execution image partially con- 
structed (when the system enters a recovery 
state from a state where the main processor is 


busy), 
pleted execution image [6]. 
system is in s 
Sr to S., 


is the same as the mean size of a com- 
Thus when the 

r, 9’ the rate of moving from 

i) is also rp. 


P.3 The probability of a validation resulting 


in an acceptance is @ as before, 
* hed e > ° ° 1 
probability of a rejection is @ = 


while the 
1-a@ 


P.4 The probability of a newly generated 


execution image being an N-image is n, 
that for being an S-image is 1 = 


while 
i on 7) ° 


Figure 5 depicts Model P. It also shows 


the notation for the equilibrium probability of 


the system being in each state s; 
babilities are denoted by I (for sy 9), 


Sm, $)s Zk » Vk 2%» WE» UO (for Sy 


The pro- 
J (for 
gs Mes and q;, 


i,j 


where k=1,2,... except that there does not 


exist y, nor x). 


The subscript k indicates the 


number of execution images present in the queue. 
The steady-state behavior of the system is then 
expressed by the following equilibrium equations. 


(a) 
(b) 
(c) 
(d) 
(e) 
(f) 


(g) 
(h) 
(i) 
(J) 
(k) 
(1) 


(m) 


(n) 


Solving this system of equations, 
the quilibrium probabilities. 


I-gp=J+mp+upj*rp+q,'rp 
J-(gp+mp) = (2; + w,)ev'e 

Z,°(gpt+v) = 
Z°(gptv) = 


I-gp-nt+yz°mp 

Z 7° SPeN+Y,41° MP for k=2,3,... 
y>°(gp+mp) = 
y,"(@ptmp) = y, _ °sP-nt4,rv-4 
k = 3,4,... 


J-gp-nt+Zi,°-v-@ 


for 


x5" mp J:gp-n t+woevea 
=y, )°SP nt wyveg for k=3, 4500 
e = J » : 4 
w tv=I-+gp+n' + x2*mp 
t 
Wy ey BP ee) 
Up TP = U,-rp 


1 
e = e e * = Z eoe 
UTP = Zsveg + wey TP for k=1,2, 


*-mp for k=2,3,... 


*-rp=w Sad da are for k=1,2,... 


ae k 


k 
(normalizing equation) 


~ | 
I+J+ugt 1 (2, +y, +x, +w,tutq.) = 1 
(4) 


we can obtain 
This system can 


be solved in closed form, but the solution pro- 


cedure is not described here. Since the sys- 
tem throughput THR, was defined as the num- 
ber of acceptances made per unit time, THR 


and Tp can be obtained by 
co [oe] 
THR, = var ( Dat Dw.) 
k=] =] 
oe eo 
T= W(vea-( D2 + D w,)). (5) 
p kel ‘ke * 
Another measure of interest is the expected 
queue-length E(QL). 
E(QL) = J+ 2D (k-(z ct Vat ht Ht YF 4) 
k=] 
where y,;=x,=0. (6) 


Figure 6 depicts the expected queue- 
eae E(QL) for various values of %,n,t vitop 
mp/ top» tr p/tm mp: In examining Figures % and 
ae are ‘mostly interested in the cases where 
am is greater than 0.9. Since fault-tolerant 
programs dealt with here are supposed to have 
undergone a testing phase before being put into 
operation, one or more erroneous object blocks 
out of ten seems highly improbable. On the 
other hand, 7 is application-dependent and may 
not be very close to 1. For example, n=0.999 
implies that only one among 1000 execution 
images generated is an S-image. In this eva- 
luation, nis set mostly within the range of 
0.9-0.95 and the most frequently used values 
are 0.9 for yn and 0.95 for a. The following 
practical constraints were also adopted. 


t= 
Vv O 
t < t 
mp Op 
1 -< t. ft $1.5 (7) 
rp mp 
As expected, E(QL) becomes larger as q 
or 1 increases, Furthermore, comparison of 


curve 3 in Figure 6a (which is a result of 
changing @ when 71=0.95) with curve 2' (a result 
of changing n when q=0.95) indicates that 
E(QL) is more sensitive to the change of n than 
to the change of a. This is also shown by a 
comparison of curve 2 (a result of changing a 
when 4=0.9) with curve 1 (a result of changing 
m1 when ¢=0.9). Figure 6b shows that E(QL) 
increases as mean validation time t, or mean 
na time t,,p) increases. When ty+ty»y< 
E(QL) is generally smaller than 5. the 
ee a obtained but not plotted in Figure 6 in- 
dicated that mean recovery time t,, affects 
E(QL) to a negligible extent. This is because 
(1) when q@ is large, the system rarely enters 
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a recovery state and (2) when q is small, the 
system rarely enters a state where the queue- 
length is large. 


3.3 Performance Comparison Between the 
Sequential System and the Parallel System 


A simple way of assessing the perform- 
ance of the parallel system is to compare the 
throughput THRp with the throughput THR, of 
the sequential system. THR)/THRgs is then 
the throughput ees and is a function of %,n, 


te) op’ trp! top» rp/tm » tog/top, and typ/trs- 
Here tos/top sled Do Pthe object block exe- 


cution time ratio while typ/t,, represents the 
recovery time ratio. These parameters are 
within the following ranges (cf. Section 2 or 
[5] for more details), 


rp rs oe 
Figure 7 depicts the throughput ratio for 
various values of parameters subject to the 
constraints in Eqs. (7) and (8). First, Figure 
7a discloses that variation of recovery time 
ratio typ/t,, within a practical range has 
little effect on the throughput ratio. This is 
again because (1) when q is large, the system 
rarely gets into a recovery state, and (2) when 
® is small, E(QL) becomes small and thus a 
recovery involves mostly a small number of 
execution images. Figure 7b indicates that 
the throughput ratio is not much affected by the 
change of trp/t,,, for % within a practical 
range, while it is significantly affected by 
object block execution time ratio tos/top: 
Object block execution time ratio oo re- 
covery time ratio t;,/trs and t, p are 
machine characteristics while glee parameters 
represent program characteristics. 


Figure 7c shows that the throughput ratio 
decreases as merging 'time tmp (more precisely 
tmp/top) increases. The obvious reason is 
because under assumption G.1 merging is in- 
volved only in parallel execution. It also shows 
that increase of t, causes a throughput ratio 
increase approximately until t,+tmpp Surpasses 

but further increase of t, does not change 
(actually slightly decreases) the throughput ratio. 
This can be explained as follows. As t,+t,, 
becomes larger than top» E(QL) becomes large 
and thus, each time a synchronizing execution 
image is generated, the queue contains a large 
number of execution images. The validation 
and merging of these are not overlapped with 
object block execution. Figure 7d confirms 
the expectation that as n increases, the through- 
put ratio increases, 


In summary, (1) for a practical a, the 
performance improvement by parallel execution 
is most sensitive to object block execution time 
ratio tos/to and tmp/top> less sensitive to 
ty/top and the least sensitive to 'rp/tmp and 
recovery time ratio trp/trs, and (2) the 

throughput ratio ranged over 1.02-1.65 (or | 
 2-65% gain) for w=0.95 and for the values of 
other parameters plotted in Figure 7. 


| Figure 7a also displays the existence of 
ay, (defined in Section 2 as the lower bound of 
qa to make the performance of the parallel 
system superior to that of the sequential sys- 
tem). The data obtained but not fully plotted 
in Figure 7 showed that in all the cases de- 
-picted in Figure 7, a, did not exceed 0.87 and 
rarely went above 0.6. It can conservatively 
be said that the practical range of q is far 
above ar: 


4, Summary 


| The analysis made in this paper confirmed 
that parallel execution can reduce the execution 
time increase inherent in fault-tolerant pro- 
grams. The analysis demonstrated largely two 
points. First, under all practical circumstances 
the parallel system showed good performance. 
The performance was particularly good when a 
was above 0.9 or 0.95. It is believed that q 
would always be in such a range for programs 
which have undergone a reasonable degree of. 
testing before being put into operation. Second, 
it showed how the effectiveness of parallel 
execution was affected by various program | 
characteristics. Although no real statistics on _ 
various program characteristics are available, 
it is believed that our examination covered a 
broad range of reasonable values for each 
parameter. Availability of a parallel system 
may influence the program characteristics to 
some extent. 


In short, the parallel execution approach 
allows the incorporation of extensive validation 
and recovery facilities without associated. ex- 
pensive execution time overhead. The price 
paid is the increased hardware requirement. | 
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Figure 5. Model P 


Ay (e@)] 
(om uw 
[> @) Ww 
o ~ g oR ST KD ow 
i 7° To G7 8 e 
u tt ft 
2 o sae Ca a £ .o st NI 
= a ae ar ome oe: (er a 2 = = 
Nea ee ee eet ~~ — ~~ ce yy Hi ul tt 
Se ss. ay ee a, a. a, 
FS t re) re) e) 
— , ~~ ~ ~ [@) 
{ o, a, a, =e 
| & & s m 
\ es) od Doel aS] 
ere 
On 
{ ° 
i 
| 
J 
| 
CJ 
oO 
| ~~ 
| “3 
é % | Ne) 
: : | ~~ 
n n oO 
> OD fe 
a | 
a Q | 
fh] fx] 
; | 
=f 
| ~ % 
ne 
ae 
| ; 
= | 
i I 
+ faa] “I a 
oO iy vf faa] NI onl ~ 
~ 4 
4 a 
< ia 
fa] 


Figure 6. Expected queue-length E(QL). 
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Analysis of Asynchronous Multiprocessor Algorithms 
with Applications to Sorting 


John T. Robinson 
Department of Computer Science 
Carnegie-Mellon University 
Pittsburgh, Pennsylvania 15213 


Abstract - ~ Efficient algorithms for asynchronous 
multiprocessor systems must achieve a balance between 
low process communication and high adaptability to 
variations in process speed. Algorithms which employ 
problem decomposition can be classified as static and 
dynamic. Static and dynamic algorithms are particularly 
suited for low process communication and _ high 
adaptability, respectively. In order to find the “best” 
method, something about mean execution times must be 
for the analysis of the mean 
execution time. are developed for each type of algorithm, 
including applications of order statistics and queueing 
theory. These techniques are applied in detail to (1) 
static of quicksort, (2) 
sort, and (3) a 


known. Techniques 


static 
dynamic 


generalizations 
of merge 
geheralization of quicksort. 


generalizations 


1 - Introduction 


We consider the design and analysis of k-process 
algorithms for an asynchronous multiprocessor system, 
which consists of k or more processors sharing a common 
memory by means of a switch or: connecting network. In 
addition there is an operating system providing such 
functions as process creation, scheduling of processes, 
allocation of memory, synchronization, etc. A real example 
of such a system is described in [7], and a general 
discussion of asynchronous parallel algorithms is 
presented in [5]. A k-process algorithm will be presented 
by giving the procedure each process executes when 
assigned a processor. We will assume that a processor is 
always available for any of the k processes that is 
runnable. 


(a)} This research was supported in part by the National 


Science Foundation under grant MCS75-222-55 and the 
Office of Naval Research under contract N00014-76-C- 
0370, NRO44-422. 
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complete subtasks, 


Given a task we wish to execute on such a system, 
in order to exploit parallelism we must decompose the 
task into a set of subtasks. Some subtasks cannot begin 
until others which they depend upon finish; this 
establishes a precedence relation between tasks. 
Inefficiency in an algorithm arises when some process 
must spend too much time waiting for other processes to 
and again towards the end of 
there are fewer than k_ subtasks. 
Attempts to remedy this by "evenly" dividing the original 
task are hopeless, since task execution time will vary due 
to variations in the input, the effects of other users, 
properties of the operating system, processor-memory 
interference, and many other causes. 
algorithm must adapt to these variations. 
adaptation in that it requires process 
Thus the trade-off between adaptability 
and process communication must be considered in the 
design of multiprocessor algorithms. 


execution when 


Any efficient 
However, this 
is expensive, 
communication. 


In the algorithms 
considered in this paper, process communication takes 
place by means of global data accessible by all processes. 
Since in many cases access to this global data must be 
confined to a critical section, one cause of process 
communication overhead is the interference between 
processes seeking access to this global data. 


Two methods of decomposition naturally arise: (1) 
static decomposition, in which the set of subtasks and 
their precedence relations are known before execution, 
and (2) dynamic decomposition, in which the set of 
subtasks changes during execution. Static decomposition 
algorithms offer the possibility of very low process 
communication, providing there are not too many tasks; 
their adaptability is limited. Dynamic 
decomposition algorithms can adapt to variations in task 
execution time very well, but only at the expense of high 
process communication. 


however, 


Given a problem which can be decomposed into 
subproblems, which method is best? Is the extra expense 
necessary for fast process communication (thus 
supporting efficient dynamic algorithms) justified? If a 
dynamic algorithm is used, how far should decomposition 
proceed? In order to answer these questions we need 
techniques for finding mean execution times for these 
types of algorithms. 


In section 2. algorithms employing — static 
decomposition are considered. We develop techniques for 
finding the probability distribution of total execution time 
in terms of the distributions of individual task execution 
times, and when these are not known, techniques for 
finding bounds on the mean execution time. In section 3, 
the mean execution time for a simple model of a dynamic 
algorithm is found, assuming exponentially distributed task 
execution times. In sections 4 and 5 the results of 
section 2 are applied to static generalizations of quicksort 
and merge sort. Certain partitioning strategies are shown 
to be unsuitable for a static decomposition version of 
quicksort. In addition, a parallel merging algorithm is 
presented and analyzed. In section 6 a dynamic 
generalization of quicksort is presented. Using a result of 
section 3, the mean execution time is found, and an 
expression for the optimal degree of decomposition is 
derived. Section 7 contains a summary of the main 
results. 


2 - Static Decomposition Algorithms 


Given a set of tasks T),To,..T,, partially ordered by 
a precedence relation <, we call T; a predécessor of iF (T; 
a successor of T)) if TjsT}. If there is no task U such that 
T; j<UST;, T; is said to be an immediate predecessor of Tj 
(1; an immediate successor of Tj). Tasks with no 
Seedeenseor: are called initial, and tasks with no 
successors are called final. 
algorithm, each process does the following: 


(1) Select either an initial task or a task all of 
whose predecessors have been completed, 
which has not already been selected. Check 
in the order T),T5,..T,,. 

(2) If no task can be selected, go to sleep, 
unless all tasks have already been selected, 
in which case terminate. When awakened go 
to (1). 

(3) Execute the selected task. 

(4) For each immediate successor of the task, 
record that an immediate predecessor has 
completed, and wake up a sleeping process if 
possible. 

(5) Repeat from (1). 


For the purposes of analysis we assume that steps 
(1),(2),(4), and (5) take zero time, and that the execution 
time of task T; is given by the random variable t;, with 
cumulative distribution function (c.d.f.) F;. 


Definition - The task-graph G associated with T,,T9,..T,, 


In the execution of the static 
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and < is a directed graph with nodes Ty,To-T, and 
arrows from T; to qj if T; is an immediate predecessor of 
Th 
J 

Note that there is a one-to-one correspondence 
between partially ordered sets of tasks and task-graphs. 


Definition - G is a chain if the tasks are totally ordered. 


The length of a chain is the number of tasks in the 

If in a chain the initial task is T; and the final task 
is qj we say it is a chain from T; to T}. A sub-graph of a 
er eae G which is a chain is said e be a chain. in G. 


chain. 


Definition - The level of a yack T in a task-graph G is the 
maximum length of any chain in G from an initial task to T. 
The depth of Gis the maximum level of any task. 


Definition - A set of tasks is independent if for any tasks 
Tj, Tj in the set, neither Tis; nor Tj <T;. The width of a 
peck -graph is the maninake, size of any independent 
subset of tasks. 


Given a task-graph G, let tg be the random variable 
representing total execution time (the time from when all _ 
processes are started until the last process terminates). 
Assume tg has c.d/f. Fg. In the following definition a class 
of task-graphs is defined for which Fg can be expressed 
simply in terms of the F;. 


Definition - Let C,,Co,..C,, be all chains from initial to final 
tasks in G. For each chain C; containing tasks T;_,T; 
let E; be the expression (x; Xin ..), where Xp XoneX_ are 
ecigneiial variables. Then 4 2 is said to be simple if the 
polynomial E, +Eo+...tE,, can be factored so that each 
variable appears exactly once (see figure 2.1). 


{ goers 


Figure 2.1 
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Theorem - If kzwidth(G), then tq can be expressed in 
terms of the t; using only + and max. Furthermore, if G is 
simple and the t; are independent, then Fo can be 
expressed in terms of the F; using only - (multiplication) 
and # (convolution). 


Proof: Note that since k2width(G) each task begins 


immediately after its last predecessor completes. Let 
C1,Co,..C,, be all chains from initial to final tasks. Then 


tc = max ( > t)) ° 


Isism Tye; 
Next note that + and max are commutative and associative 
Operations, and that + distributes over max (i.e., 
max(a,b)+c=max(atc,b+c)). Thus if G is simple the 


expression for tg above can be factored in terms of max 
and + so that each random variable appears only once. 
Then, if the t; are independent, the expression for Fq may 
be found by substituting F; for t), # for +, and - for max in 
the expression for tg (see figure 2.2). 


Figure 2.2 


tg=max (max (t,, to+ts, tatty Foa (CF Fo) ag) F a) aFe : 


Thus in the proof of this theorem we have a 
method for calculating the c.d.f. of total execution time for 
simple task-graphs with independent task execution times, 
providing we know the c.d.f. of the execution time of each 
task. When the c.d.f.s of each task’s execution time are 
not known, the best we can do is derive bounds .on mean 
execution time, such as those of the following theorem. 
The expected value of a random variable x is denoted by 
E(x). 


Theorem - Given a task-graph G with k2width(G) and with 
the t; independent, let Cj ,Co,..C,, be all chains in G from 
initial to final tasks. Also let H; be the set of all tasks of 


level i, for 1<i<l where l=depth(G). Then 

max = { E(t ;)) < E(tp) < »: E( max t;) 

lsism Tye; lsisl Tj eH; 
(2.1) 


Proof: From above, 


The lower bound then follows from E(max{x;}) 2 max{E(x;)} 


for any random variables x; 


;; For the upper bound, let 
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to20 and define f(i,j)=0 if CjnH; is amps otherwise rij) | is 
the index of the single ck in C nH;. Then 


max ( > tec, j)) * oS (max(t¢ (;, jp? 


l<ism 1lsj<i l<jsl lsis<m 


tg = 


~ from which the result follows. 


The upper bound in equation 2.1 is useful only if 
something can be said about E(max{t; }). An applicable 
result from order statistics (see 2) is that if the 
independent random variables xj ,X9,.Xm, are identically 
distributed with mean u and standard deviation s, then 


m-1 


E€ maxixi}) su+ —= s (2.2) 
Vl 2m-1 
Hence the following corollary: 
Corollary - If k2width(G), the t; are independent, 
depth(G)=|, and the mj tasks on level j have identically 


distributed execution times with mean Uj and standard 
deviation Sj, then | 


u; s E(t) < ur + —U-l gs. 
1<js| 1<jsi Vemj-1 


Let w=width(G). When w>k, Faq cannot in general be 
expressed simply in terms of the Fi, even when G is 
simple and the t, are independent. For example, let G 
consist of T;,T9,Tg with the set {7,,To,Tg} independent, 
and let k=2. Then tg=max(min(tj,to)+tgmax(t, to), and ta 
cannot be simplified further. 


When w>k, the lower bounds for E(tg) given above 
still hold. For an upper bound we take the following 
approach. It is assumed that w processes are created, 


and each process has a processor available at least k/w 


of the time. For example, the bound given in the 
corollary becomes 

Wj Ss E(tp) s > (44 + mi-1 3) (2.4) 
1<jsl kK 4sjsl y2m,=1 

Finally, when the t; are dependent, in general 


special techniques must be used, such as those in the 
analysis of partitioning strategies (section 4) or parallel 
merging (section 5). 


3 _- A Dynamic Decomposition Algorithm 


Given a task T and a procedure which decomposes 
a task into two tasks which may be executed 
concurrently, we consider the following dynamic 
algorithm: First, there is a decomposition phase, in which 
each process repeatedly removes tasks from the task- 
queue TQ (which initially contains only T), decomposes the 
task and inserts the two new tasks in TQ, until there is a 
total of M tasks. Next, there is an execution phase, in 
which each process repeatedly removes tasks from TQ 
and executes the task. 


We analyze this algorithm under the following 
assumptions: 


{1) In this section the time to access TQ is 
assumed to be 0. 

(2) The time to decompose a task is assumed 
to be exponentially distributed with mean 
di 
tasks. 

(3) The time to execute a task is assumed to 


be exponentially distributed with mean ey. 


1 where i is the current total number of 


We use standard queueing theory techniques in the 
analysis (see for example [3]). Adopting as a state 
variable the total number of tasks in TQ or currently 
being executed or decomposed, the state-transition-rate 
diagram is given by figure 3.1. 


Figure 3.1 
GQ-GrPGP .. 2p .. > 
dy 2d5 3dy kd, kde ay Kdy_y 
Ken 
OOO Oo GO 
en 2ey sey key, key ken 


The mean execution time is found to be: 


Ts +(e + Hi] rn > oa) 
en k minli, kid; 


1<i<M-l 


where a = (1 + 1/2 + 1/3 + 1... + 1/k). 


4 ~ Static Quicksort 


We consider a static generalization of quicksort as 
given by the task-graph of figure 4.1 (see [6] for a 
complete discussion of sequential quicksort): 


Figure 4.1 


P31 P32 P33 Pa 


The tasks may be described as follows: 


(1) P; is a partition of the file to be sorted. 
(2) Pi j (j odd) is a partition of the left subfile 
produced by Pi-4 (j+1)/2 

(3) Pi ij (j even) is a partition of the right 
subfile produced by Pi-4 j/2 

(4) Sj (j odd) is a quicksort of the left subfile 
produced by PL-1(j+1)/2: 

(5) Sj (j even) is a quicksort of the right 
subfile produced by PL-1j/2 


First consider the simplest case, where k is a 
power of 2 and L=1+lg(k) (where lg is logs). In this case 
the width of the task graph is k. The question arises as 
to what partitioning strategy to use, that is, how should 
the partitioning element be selected in the P tasks? First 
a definition of asysmptotic mean speedup: 


Definition - Given an algorithm for k processes, let the 
mean total execution time be TN), where N is the size of 
the input. Then the asymptotic mean speedup S, is 
defined to be 


We would prefer a partitioning strategy which gives 
asymptotic mean speedup of k even in the simplest case; 
strategies which depend on large L for speedup are 
unsuitable since the number of tasks increases 


exponentially with L, and one of the main advantages of 
static algorithms is low overhead. 


It is now necessary to make some assumptions. 
about the execution times of tasks. In the sequential 
analysis of quicksort it is found that partitioning a file of 


size N takes O(N) time with standard deviation O(N), and 


that sorting a file of size N takes O(N Ig(N)) time with | 


| standard deviation O(N) (see [6)]). 
asymptotic mean speedup it is only necessary to consider 
the sorting task times. 


Thus in.. analyzing 


(1) When the partitioning element for a partition of — 


a file of size M is selected at random, it is natural to 
assume that either subfile size is uniformly distributed 
between 0 and M. This, together with the fact that the 
sum of the subfile sizes is M, gives an expected maximum 
subfile size of 3M/4. Using this, it is easy to show that of 
the k subfiles to be sorted in the sorting tasks, the 
expected maximum subfile size is at least (3/ayle(W)y, 
which implies S,<k'8(4/3) 


(2) If the median of three method is used to select 
the partitioning element, and if it is assumed that the final 
position of each of the three elements in the subfile is 
uniformly distributed between O and M, then the 
probability density function for the size of either subfile 
is: 


f (x) -F( - <n 
M myn 


This gives an expected maximum subfile size of 11M/16. 
As in (1), it can be shown that the expected maximum size 
of the subfiles to be sorted is larger than (11/16)!8W) yy, 
It follows S,sk'e(16/1 1), 


(3) If the partitioning elements for all partitioning 
tasks are found using the method of samplesort (first. pick 
k-1 elements randomly, sort, and use these for the k-1 P 
tasks), and if the final position of each of the k-1l 
elements is assumed to be uniformly distributed between 
O and N, then the probability density function for the size 
of the largest subfile to be sorted is: 


Gm >, (1) tet (iN a i 
j-l N e 


1l<js<|N/xJ 
(See the discussion on the random division of an interval 
in (2). It follows the expected maximum size of the 
subfiles to be sorted is: 


132 


FN. 
xf (x) dx = 
8 


partitioning element. 


oP Paces )- - Hen, 
k 


* 1sjsh 
Hence Sp = K/H,. 


(4) Finally we turn to the partitioning | strategy of 
first finding the median (in O(M) time, where M is the size 
of the subfile) in each P task; and using the median as the 
This does give S,=k, but it should 
be noted that median finding represents a large overhead, 


_ Unless process communication is extremely | expensive, a 


dynamic generalization of quicksort (such as the one 
presented in section 6) is probably better. 


If the mean and standard deviation of the time to. 
quicksort a file of size M are @gM. Ig(M) and bg M, and the 
mean and standard deviation of the time i find the 
median of a file of size M and partition the file using the 
median as partitioning element are a pM and b.M, then 


"p 
from equation 2.3 we find that the nae total execution 


‘time is less than 


k-~1 


k 


b 


a 


y 


N N 1 
a, fo ia(%)+[20 (i- 2+ 
at) k as er | 
2)-1 »b 
2) Noi+l- 
cae anaes 


When L is greater than 1+Ig(k) a similar result may be 


found using equation 2.4. 


5_- Static Merge Sort 


Consider a static generalization of merge sort as 
given by figure 5.1 (see [4] for a discussion of sequential 
merge sort): 


Figure 5.1 
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The tasks may be described as follows, assuming the file 
to be sorted consists of records 1 through N: 


(1) S| is a merge sort of all the records 
between (i-1)N/2b71) and i(N/2e~1)41, 

(2) Mo j is a merge of the two sorted files 
produced by S)_) and So). 

(3) Mi (i>2) is a merge of the two sorted 
files produced by Mi-1 2j-1 and Mi_4 2;- 


When k is a power of 2 and L=1+lg(k), the width of 
the task graph is k and equation 2.3 may be applied. 
Assuming the time to merge sort a file of size N has mean 
a N Ig(N) and standard deviation b,N, and that the time to 
merge two files of sizes M and N has mean a,,(M+N) and 
standard deviation b,, (see [4]), we find that the mean 
total execution time is less than 


ag(M )ia(® +22, (1- = \N (x1) beAN_ 


k 2k--k 


+ j_ Dn 
> os Vr 


1<jslg(k)-1 


When L is larger than 1+lg(k) a similar result holds, using 
equation 2.4. 


In the remainder of this section we consider one 
possible improvement: replacing the merging tasks with 
parallel merges. A two task merge of two files is possible 
by letting each task be an instance of the usual sequential 
two-way merge (see [4]), except that in one task merging 
begins with the two smallest items of the two files (a 
merge from the left), and in the other task merging begins 
with the two largest items (a merge from the right). In 
addition the two tasks are interlinked as follows: in 
sequential two-way merge, the pointers to the files are 
compared to the ends of the files; in a two task merge, 
the pointers of one task are compared to the pointers of 
the other task. Because of this, the two tasks finish 
together almost exactly, providing one has not already 
finished before the other starts. We now assume a 
sequential two-way merge of two files each of size N 
takes time 2a,,N. Hence a two process merge using the 
above method would take time a,,N. 


Next consider the following merging algorithm, for 
k=4; 
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Figure 5.2 


Assume the elements to be merged are x1<x9<x3X...<xX¥y 
and y1;<yo<y3<.-yyy The tasks are: 


I,: Insert XIN/2] into the y;’s. 

Io: Insert YIN/2) into the x;’s. 

Z: The results. of the insertions determine 
three pairs of subfiles, as shown below. Z 
determines the subfile pairs and initializes 
the L; and R; tasks. 

Li: Merge from the left of the ith subfile 
pair. 

R;:; Merge from the right of the ith subfile 
pair. 


Figure 5.3 
++ 
LA 2 / 3 
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If process 1 executes Ly and process 2 executes 
Lo and then Rj, process i finishes before or with process 
2. Let the sizes of the subfiles in the second subfile pair 
be X and Y. The execution time for process 2, starting at 
the completion of Z, is: 


noe OE afer AZo 
2 2 


since (X+Y)/2sN/2. The same result holds for the process 
executing Ro and Lg. In order to find the distribution of 
|X-Y], it is assumed all elements x;, y; are distinct, and that 
all permutations are equally likely. Then the probability of 
inserting x 1 in position i is: 


(n) 


‘ + aN -1 


N(2-a) - , 
oN - 1 


P (yi <x un<Yj 41) = (1-a)N 


il 


mT) Cn) 


(j40N) ( 2N 
i+oN 


: 2oN e~ (i-oN) 4/N 
CitoN Nr 
using the normal approximation to the binomial 
distribution. This distribution is again approximately 


normal, with mean aN and standard deviation N/2. 
Assuming X and Y are actually distributed normally, the 
mean of |X-Y| can be calculated to beV2N/n. Hence, 


on (he Bh ova) 


k 
where the O(lg(N)) term is from the insertion tasks. 


E (tp) s 


Other merging algorithms for k=4 and for higher k 
can be devised by using various element 
strategies. 


insertion 
Similar techniques may be used in their 


analysis. 


6 - Dynamie Quicksort 


We may use the dynamic algorithm of section 3 for 
sorting, where tasks are considered to be subfiles, the 
decomposition of a task is a partition of the subfile into 
two subfiles, and the execution of a task is a sort of the 
subfile. In analyzing this algorithm we make the following 
assumptions, where the file to be sorted contains N 


records: 


(1) If M is the total number of subfiles to be 
produced during the decomposition stage, the 
total number of task-queue accesses is 3M-2, 
and each process makes an approximate 
average of 3M/k accesses. We 
assume the overhead due to 
process communication is linear in M, and is 
given by w(k)M.. 

(2) When there are ji 
subfile size is N/i. 


such 
therefore 


subfiles, the mean 
It is assumed the time 
needed to partition a subfile is exponentially 
distributed, and that when there is a total of 
i subfiles the mean time is aN/i. 

(3) During the task execution phase, the 
average subfile size is N/M It is assumed 
the time to sort one of the M _ subfiles 
produced by 


decompositioning is 
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exponentially 
b{(N/M)In(N/M). 


distributed, with mean 


From equation 3.1, the mean execution time T(M,N,k) is: 


TIM,N, Kk) = W(k)M 4 (Min (X) (To ‘ Hx] + 
mM) \M/AV kK 
N N 
a + a> 
A 2. ai 
lsisk-1 ksi<M-1 
= wtk)M +N tb InN - aly) + aHyy - b In M) 
k 
N\i4/N 7 (2) 
+ bfN\infN\ cH, - 19 + aNH,_ 
Ca)"  e 


Given N and k, we seek to find M so as to minimize 
T(M,N,k). If we approximate Hyj-1 by In(M), then M must 


satisfy 
OTs (ky) +Nfarb) + pniH,-1) [te tajte Nol 
OM kM M 
= @. 
let A = W (kK) and B = fa-b) ; 
bN(H, — 1) pk (H,-1) 


then the optimal value of Mis the solution of 


Mee (ANZ+BM-1) 2 yy. 


A short table of the optimal integer value of M for 


various values of w(k)/b follows, for the case k=4, a=b, 
N=108; 


W(4)/b M 
10 938 
18¢ 313 
193 185 
194 35 
18° 11 
Thus, given a,b,N, and k, the optimal degree of 


decomposition is determined by w(k), 


communication overhead. 


the process 


7_- Summary 


We have classified asynchronous multiprocessor 
algorithms which employ problem decomposition as static 
and dynamic. Static decomposition algorithms require 
little process communication and would be well-suited for 
systems where process communication is expensive, e.g., 
“loosely-coupled" computer networks. 


A static decomposition algorithm is described by a 
task-graph. Simple task-graphs have the property tnat 
there is a simple expression for the probability 
distribution of total execution time in terms of the 
probability distributions of each task, providing the result 
of one task does not affect the execution time of another. 
If the probability distributions of each task’s execution 
time are unknown, it is still possible to bound mean total 
execution times providing the means and variances of task 
execution times are known. 


Regarding the upper bound given by equation 2.3, 
the bound is tight in that task-graphs and task execution 
time probability distributions may be constructed so that 
equality holds, using distributions derived in [2] Any 
improved bound would require either more detailed 
information about the partial ordering of the tasks in the 
expression of the bound, or additional assumptions about 
the probability distributions of task execution times. 


When 
dynamic decomposition 


communication _ is 
algorithms are suitable. One 
technique for analyzing these algorithms is by means of a 
queueing model. 


process inexpensive, 


in 
analyzing other types of asynchronous parallel algorithms 
as well (e.g., in [1] a queueing model is used to analyze 
asynchronous iterative methods). 


Queueing models may be used 


For some static decomposition algorithms the 
bounds derived in section 2 may be directly applied, such 
as static quicksort with median finding and static merge 
sort. In other cases where task execution times are 
dependent other techniques must be used. This is the 
case for static quicksort when median finding is not used 
and in the parallel merging algorithm presented. These 
algorithms have dependent task execution times since 
there are tasks where the input size depends on the 


result of a previous task. 
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The that communication 
overhead is negligible in static decomposition algorithms 


is valid only if the total number of tasks is not very large. 


assumption process 


For this reason we have given bounds on mean execution 
time only for those algorithms in which the width of the 
task-graph is k (although a technique for greater width 
task-graphs has also been presented). These bounds 
give an indication of the performance that can be 
expected when process communication overhead is high 
enough to warrant the use of static decomposition: 
However, in dynamic decomposition algorithms we may 
choose the degree of decomposition, which should ideally 
be chosen so as to balance process communcication 
overhead and adaptability to variations in the execution 
times of tasks. For example, by applying a queueing 
model to a dynamic generalization of quicksort, we have 
derived an expression relating process communication 


overhead and the optimal degree of decomposition. 
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ON THE PERFORMANCE AND COST-EFFECTIVENESS 
OF SOME MULTIPROCESSOR SYSTEMS 


Terry T. Hsu 
Digital Image Systems Division 
Control Data Corporation 
Minneapolis, Minnesota 55440 


Summary 


The performance (maximum throughput) and cost 
are analyzed for multiprocessor systems using the 
global bus, shared memory, full interconnection, 
and ring configurations for inter-processor commu- 
nications [1] - [3]. The approach is to identify 
the major characteristic parameters and express 
the throughput and cost for all configurations in 
Similar forms for easy comparisons. 


Let the throughput of a single processor ele- 
ment (PE) be defined as Tp = 1/tp, where tp is the 
average processing time of an instruction. In the 
absence of inter-PE communication, any multi- 
processor system of N PE's has a maximum system 
throughput (T) of NIp. This bound is processor- 
limited. However, assuming each PE has to do a 
fraction (q) of its processing in communicating 
with some other PE's through the bus, the shared 
memory, or paths between PE's, as the case may be, 
then the system throughput may be bounded by the 
transfer rate or bandwidth of the communication 
path. This bound is said to be bandwidth-limited. 
By considering the utilization of the path by each 
PE, the bandwidth-Limited upper bounds on system 
throughputs can be shown as follows: 


Bound On Utilization 
System Throughput Factor 
Global Bus Tp/b = qt, /tp 
Shared Memory Tp/m m = qt_/(epM) 
Full Connection NTp 
Ring NIp/r r= qdt_/tp 


Where th» tp, and t_ are the cycle times to 
communicate an instruction or data between PE's 
(adjacent in the case of a ring system) in above 
systems, respectively. M is the size of the common 
memory, and d is the distance (number of links) 
between the communicating PE's. 


The bandwidth-limited system throughput be- 
comes minimum when a PE has to wait for all other 
(N-1) PE's to complete their use of the path first. 
This leads to the expression for a lower bound for 


all cases. NT 
= Seed NED s . 
Tdmin) = TWD u 


Where u is the utilization factor for each 
case (b,m, and r above) and is q(N-2)/(N-1)2 for 
the fully interconnected system. 


Of course the achievable maximum system 
throughput is determined by the smallest of all 
above bounds - processor or bandwidth limited. 


The system cost is assumed to be composed of 
five major component costs: PE (including local 


memory), cable, I/O port, switching unit, and 
extra memory. Let the unit costs be normalized to 
the cost of one PE (€pe) and denoted as Kc, Kp, 
Ks, and Km, respectively. The total system cost 
can then be expressed in terms of Cpe, the normal- 
ized cost coefficients and some coefficient 
multipliers, which are as follows: 


System Ke Kp Ks Km 
Global Bus N N N, 
Shared Memory N 2N N 1 
Full Connection N(N-1) N(N-1) N(CN-1) - 


Ring N N 2N - 


For example, the cost of a shared memory 
system with N PE's is expressed as Cpe (N + NKc + 
2NKp + N°Ks + Km). 


With above analytic expressions for the four 
systems, a cost-performance ratio can be obtained 
readily. Figure 1 compares the four systems on © 
their best performance, assuming b=m= r= .l, 
and Ke = 0, Kp = .06, Ks = .02, Km = 1. The pur- 
pose of the diagram is more illustrative than 
conclusive. Since many parameters are involved, 
one should be cautious in interpreting the graphs. 
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A PERFORMANCE STUDY OF DISTRIBUTED CONTROL LOOP NETWORK* 


Ming T. Liu, Roberto Pardo, and Gojko Babic 
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Summary 


This paper presents some analytical results 
on the performance of three types of distributed- 
control loop networks, viz., the Newhall loop 
(single message of variable length), the Pierce 
loop (multiple messages of fixed length), and the 
DLCN loop (multiple messages of variable length). 
The primary goal of the paper is to show, through 
queueing analysis, that the DLCN loop has superior 
performance over the other two loops, viz., it has 
better channel utilization and shorter message 
delay. A secondary goal is to verify our previous 
simulation results which made such a claim. 


Introduction 


The loop network is becoming increasingly 
popular today for the design of distributed pro- 
cessing systems. A loop network is composed of a 
high-speed digital communication channel, arranged 
as a closed loop to which computers, terminals and 
other devices are attached through loop interfaces. 
Messages from a sender are multiplexed on to the 
loop by its interface, then travel around the loop 
from interface to interface until received by the 
interface of the addressed receiver. Thus the 
design of the loop interface and the transmission 
mechanism it incorporates are very important in 
the operation of a loop network. 

Three transmission mechanisms have been pro- 
posed for use in distributed-control loop networks. 
In the Newhall loop, a round-robin control passing 
token circulates around the loop and allows only 
one interface at a time to transmit onto the loop 
a single message of variable length. In the Pierce 
loop, communication space on the loop is divided 
into fixed-size time slots, and it is possible 
for more than one interface to transmit onto the 
loop multiple messages of fixed length at a time. 
In the DLCN loop, the use of a delay buffer in the 
interface allows simultaneous transmission of 
multiple messages of variable length. 

The superior performance of DLCN transmission 
mechanism has been verified by an extensive simu- 
lation study [1]. In this paper we present some 
analytical results on the performance of DLCN as 
compared with Newhall and Pierce loops. The main 
problem in making the comparison is that, in the 
analysis of each loop, different characteristics 
of data sources are assumed, 


Analytical Results 


In this section formulas for channel utili- 
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the paper is available from the authors. 


137 


~ ZO 


zation and average message delay for the three 
loops are given. Only symmetric loops (symmetric 


_ traffic pattern and identical nodal characteristics) 


are considered. A glossary of terms to be used in 
the following is listed in Table l. 


Capacity of communication channel (bits/sec) 
Number of nodes in the loop 
Arrival rate of messages from the data 
source (messages/sec) 

1/u Average message length (bits) 

a? Second moment of message length (bits) 

1/u'. Average duration of active period of data 
source (seconds) 

1/\' Average duration of idle period of data 


source (seconds) 
b Bit rate during active period (bits/sec) 


u Utilization of data source 

B Number of information bits per packet (bits) 

T Average message delay (including all waiting 
times and time for multiplexing message onto 
loop) (second) 

W Average message waiting time (= T - 1/uC) 

U Channel Utilization 

B Number of bits in the address field (bits) 


Table 1. Glossary of Terms 


DLCN Loop Analytical Results 


Formulas presented below for the DLCN loop are 
derived in a paper by Liu, Babic and Pardo [2]. 
Data sources are assumed to be characterized by 
instantaneous generation of message according to 
the Poisson process and distribution of message 
lengths is general. 


1. Channel Utilization: 

U = AN/2uC (1)- 
2. Average Message Delay: 

T= T, + 1/uC + NB/2C + (N/2 - 1)T, (2) 


1 
where —_— 
rs Nia*y/4c(Cy - A) (3) 
T, = Nra*p2/2(Cu - A)(2Cyu - NA) (4) 


Pierce Loop Analytical Results 


Formulas presented below for the Pierce loop 
are derived in a paper by Hayes and Sherman [3]. 
In their model data sources are characterized by 
having active and idle periods and both are assumed 
to be exponentially distributed. During an active 
period a data source produces bits in constant rate 
and generates one message. 


1. Channel Utilization: 


U = rNB/2C | (5) 
where . 
r = u'Qu 7 (6) 
Q = GQ ~ EXP(-But/b))™ (7) 
u = At/ (CA' + u") (8) 
2. Average Message Delay: 
T = y/u' + yo*/p 
+y70(1 + 6*)2/u'(1 - ye(1 + 6*)) 
+ U*/M*(1 —- yO(1 + 6*)) (9) 
where 
Y = u'QB /C (10) 
6 = At/u' (11) 
oe = R&B /(C - R*B) (12) 
R* = r(N/2 - 1) (13) 
U* = R*B/C (14) 
MK = u'/y(1 + 0%) (15) 


Newhall Loop Analytical Results 


Formulas presented below for the Newhall loop 
are derived in a paper by Kaye [4]. Data sources 
have the following characteristics: After a data 
source generates one message it will be inactive 
until that message is multiplexed onto the loop. 
Afterward the data source behaves as a Poisson 
process with parameter A, but only until the next 
message is produced. 
inactive,and so on. This process generates mes- 
sages with effective average interarrival time 


off = (1 - P,) (Py is given below). 
1.. Channel Utilization: 
v= @d- P, )A/2uC (16) 
where P. is the portion of lost messages given by 
PL = AT/(1 + AT) (17) 
2. Average Message Delay: 
T = l/c - 1/d 
one AT. 
+ (1/n) = t_e n(N - n)p (18) 
n | n 
n=0 
where _ N 
n = £ np. (19) 
n=0 
= N/C + n/uc ~ - Pas (20) 
| n.™1 aN/c_ag/uc | 
p= KO) me NCMMHE = 1) (21) 
| 5 a 
for n = 1, Dis. die N, and Po = K for n = 0, where 
N 


K is determined by 2% Pa = 1. 
n=0 


Then the data source is again 


Performance Comparison 


Because of different characteristics of data 
source assumed in the analysis of each loop, we 
have decided to compare the DLCN loop against the 
Pierce loop and the Newhall loop separately. Fig- 
ures 1 and 2 show average message delay of the 
DLCN loop against the Pierce and the Newhall loops, 
respectively, and clearly verify that the DLCN 
loop has shorter message delay. Better channel 
utilization can also be easily verified by com- 
paring Equations (1), (5) and (16). 
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PERFORMANCE ASPECTS OF MULTIPROCESSING IN A TIME SHARING ENVIRONMENT 


Major Thomas C. Darr 
Rome Air Development Center 
Griffiss AFB 
Rome, NY, 13441 


cummary 


One of the goals of the computer architecture 
program at RADC is to investigate the cost/perfor- 
mance benefits and trade-offs in the application 
of current and future technology to Air Force and 
DOD information processing systems. It has been 
recognized [1] that software complexity is respon- 
Sible for a major percentage of the life cycle 
cost of most systems. It is also recognized that 
the availability of very low cost microprocessors 
and other LSI devices presents the potential for 
the design of highly reliable, low hardware cost 
systems for a number of applications. Opportuni- 
ties also present themselves for the reduction of 
software complexity by judicious use of such low 
cost hardware and firmware. 


This paper investigates the performance of a 
modern demand paged, virtual memory, multiprogram- 
med time sharing system (the MULTICS system at 
RADC). The average system response time for 
interactive users (editing, debugging, small 
compilations) is the basic performance yardstick 
for such systems. It is shown that the system 
response time is governed and limited by the 
memory management mechanism. A queueing network 
model of the system is used to aid the analysis 


[2]. 


A model of a multiprocessing system is ana- 
lyzed [3] with particular attention to comparison 
of system response time, cost, complexity, and 
reliability with that of the’ single processor 
system. It is shown that performance is still 
limited by the memory management mechanisms. 
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It is conjectured that, within the capabili- 
ties of forecasted technology in LSI, microproces- 
sors, and computer communications, cost/effective 
organizations of such systems will eliminate the 
interactive processing function from the central- 
ized resources. These functions will be placed 
in highly intelligent terminals, leaving the 
secondary storage media with user and system data 
and programs, and a sophisticated data base proces-— 
sor, at the central location. It is shown how 
this organization simplifies required operating 
system software, allows for nearly unlimited 
growth in the user population, and increases sys- 
tem reliability. Other benefits, such as ease in 
processing secure data, are discussed. 
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_ STARAN SERIES E 
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Abstract. sTaRAN‘?) series E is an enhanced 
version of the STARAN parallel array processor. 
Multi-dimensional-access data storage and high 
speed control storage have been increased several- 
fold. Faster IC's and new processing algorithms 
improve the processing rates significantly. A | 
new I/O unit allows faster and more flexible 
input and. output of data. The cost impact of 
these changes has been minimized by using stand- 
ard parts which were not available when the first 
STARAN was built. In this paper we discuss the 
enhancements and the reasons they were incorpor- 
ated. ) 


Introduction 


As explained in [1] all logic and memory 
devices in STARAN are standard high-volume inte- 
grated circuits. The semi-conductor industry is 
continually improving these circuits, e.g., more 
bits on a chip and higher speeds. Thus, it is 
relatively easy to enhance STARAN by using the 
newer circuits as they become available. The 
goal in the design of the Series E systems was to 
enhance the capabilities of STARAN using newer 
circuits and better processing algorithms while 
preserving software compatibility with the 
original design. 


The major enhancement in the Series E 
systems is a much larger multi-dimensional-access 
(MDA) memory in each array module [2]. Other 
enhancements are a larger high-speed control 
store, faster processing rates and a new I/O unit. 


Larger MDA Memory 


Need for Larger Memory 


The MDA memory in the original design used 
256-bit random-access-memory (RAM) devices. 
These were the largest bipolar RAM devices that 
were generally available at the time (1971-1972). 
The size of these devices governed the choice of 
256-bit words with a simple PE per word. 


After several applications were programmed 
on the system it became evident that larger words 
would be better. In most applications the size 
of the machine (number of array modules) is 
dictated by memory requirements instead of 
processing speed requirements. In some applica- 
tions some data is off-loaded into the control 
store to make room in the MDA memories. In some 
other applications two or more words are used per 
item and some of the PE's are wasted. 


Another indication that a larger MDA memory 
is desirable appears when the ratio of storage 


(a) 


T.M. Goodyear Aerospace Corporation, 
Akron, Ohio 44315 


bits to processing speed (MIPS) is examined for 


_ several large computer systems. STARAN has a low 


and memory capacity independently. 
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ratio compared to other systems (either we have 
too many MIPS or too few bits). Increasing the 
storage by a factor of 20 would bring us in line 
with other systems. , “eu 


MDA Memory in Series E 


In the original STARAN design the number of 
bits per word was fixed at 256. Each array 
module contained 8K bytes in a 256 x 256 bit array 
and 256 PE's. To increase the storage capacity 
of a machine one added array modules and added 
more PE's as well. In the Series E machines one 
can size the processing power (number of PE's) 
This is 
accomplished by multiplying the MDA memory address 
Space by a large factor (256) and allowing the 
space to be partially populated. 


Bipolar 1,024-bit RAM devices are generally 
available today.. These devices have ECL - compat- 
ible interfaces so that they match the logic 
levels of STARAN better. than the TTL - compatible 
devices of the original design and have faster 
access and cycle times. Thus, these devices are 
a natural choice for Series E. 


MOS 4,096-bit RAM's are also generally 
available. To keep the slow MOS speed from 
severely affecting the processing rate of the 
machine, part of the storage should be bipolar 
and algorithms modified to move most of the 
memory accesses into the high-speed bipolar 
storage. The section on faster processing rates 
discusses these modifications. 


A mixture of high-speed bipolar and low- 
speed MOS devices is allowed in the MDA storage 
of a Series E array module (Figure 1). The MOS 
MDA memory is an array of 256 mK bits where K= 
1024 and m is a multiple of 4. It uses 4,096-bit 
MOS RAM's. The band width of the MOS memory is 
256 bits in or out every 420 nanoseconds (76 
megabytes/second). The bipolar MDA memory is an 
array of 256 by nK bits where n is an integer. 

It uses 1,024-bit bipolar RAM's. A read cycle in 
the bipolar memory requires 120 nanoseconds (267 
megabytes/second) and a write cycle takes 160 
nanoseconds (200 megabytes/second). For compari- 
son, the MDA memory of the original STARAN had a 
read cycle of 120 nanoseconds and a write cycle 
of 300 nanoseconds. 


The maximum MDA storage in each array module, 
limited by the address space, is 256 x 64K bits 
(2,097,152 bytes). The physical size of the array 
module grows with storage capacity and depends on 


the mix of bipolar and MOS storage. In the first 
Series E machine, m=8 and n=l, for a capacity of 
262,144 MOS bytes and 32,768 bipolar bytes per 
array module. Two array modules are packaged in 
one STARAN cabinet (array modules with greater 
storage capacity are packaged one per cabinet). 
For a cost increase of less than 50% these array 
modules have 36 times the storage of the original 
STARAN modules. 


Accessing MDA Memory 


In the original design each MDA memory access 
(fetch or store) required two parameters: an 
8-bit access mode to select one of 256 stencil 
shapes and an 8-bit address to position the 
selected stencil at one of 256 positions [2]. 


The larger MDA address space in Series E 
requires a 16-bit address. Some thought was 
given to also increasing the access mode para- 
meter and allowing some stencils to access every 
ith bit slice of a word (i=2,4,8,etc.). This idea 
was rejected because such stencils do not appear 
to be generally useful and they are hard to imple- 
ment in a memory which is partially populated with 
both fast and slow memory devices. Because of the 
way data is scrambled in memory it is important 
that all 256 memory bits accessed at one time act- 
ually exist and either be all "fast" bits or all 
"slow" bits. The basic memory increment is 1,024 
bit-slices so the maximum allowable access mode 
parameter is 10 bits--the increase from 8 bits to 
10 bits does not add any significant MDA capabil- 


ity so the access mode parameter was left at 8 bits. 


In Series E, one may view the MDA memory of 
an array module as a number of 256 x 256-bit planes 
(Figure 2). The leftmost 8 bits of the 16-bit 
address select one of the planes. The 8-bit access 
mode selects a stencil shape and the rightmost 
8 bits of the address positions the stencil within 
the selected plane. All 256 bits covered by the 
stencil are fetched or stored in one memory cycle. 
The shapes of the 256 possible stencils are 
discussed in (2). 


Bipolar MDA memory occupies the first 4n 
planes (n=1,2,3,...) and MOS MDA memory the next 
4m planes (m=0,4,8,12,...). 


Base Registers 


In the original design the 8-bit MDA memory 
address came from one of five sources: an address 
field in the instruction, the resolver through the 
link pointer or one of three field pointers. The 
instruction address field is usually used to ref- 
erence flag bits in fixed locations. The resolver 
and link pointer are used to reference particular 
words in the MDA memory, e.g., a word satisfying 
an associative search operation. The three field 
pointers are used to step through the bit-slices 
of fields in arithmetic and search operation; e.g., 
to add field A to field B with the result put in 
field C the three field pointers reference corres- 
ponding bit-slices of fields A, B and C. 
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In the original design the 8-bit MDA access 
mode came from one of two access mode registers 
(AMRO and AMR1) depending on the state of a mode 
bit in the instructions referencing MDA memory. 


In Series E, five base registers are included 
in the control unit; one for each of the five 
sources of MDA memory addresses. The final 16-bit 
MDA memory address is formed by adding the 8-bit 
source to a 16-bit base address in the associated 
base register. The 8-bit access mode comes from 
a field in the base register. This arrangement 
allows addressing of: 1. flags in a flag-bit 
region, 2. words satisfying a search, and 3. fields 
in scattered memory regions without modifying the 
base registers. 


The speed of arithmetic instructions with 
long sequences of micro-steps (multiply, divide, 
square root and floating-point) would be severely 
affected if all micro-steps addressed fields in 
the MOS MDA memory. To speed up these instruc- 
tions, a sixth base register is used as a pointer 
to the base of a vector stack in the fast bipolar 
MDA memory. The long arithmetic instructions 
move vector operands from MOS memory to the bi- 
polar stack, operate on the stacked vectors and 
then return the results to the MOS memory. Guard 
bits are added to the vectors when moved onto the 
stack and rounded-off when results are unstacked. 


The mode bit of the instruction (used in 
the original design to select AMRO or AMRI1) is 
used in Series E to select vector stack address- 
ing or the five-base-register addressing. In 
vector stack addressing, the 16-bit stack base 
address in the sixth base register is added to the 
8-bit source regardless of its source so system 
micro-programs operating on stacked vectors can 
use all address sources. 


A set of sixteen 32-bit registers is added to 
the control unit in Series E. Six of the regis- 
ters are the MDA base registers just discussed. 
Another eight registers are the return-jump 
registers (RO - R7) of the original design. The 
other two registers can be used as general-purpose 
registers. The Series E instruction set is 
augmented with instructions to manipulate these 
registers. 


Control Memory 


The control memory of the original design had 
an address space of 65,536 32-bit words populated 
with three high-speed 512-word page memories, one 
high-speed 512-word data buffer and a 16,384-word 
magnetic core memory. The remainder of the address 
space could be used to address the memory of a 
host computer if such an interface exists or to 
double the capacity of the pages, the high-speed 
data buffer and/or the core memory. 


The page memories hold the micro-program 
instructions of the system subroutines and user- 
generated micro code. For some applications page 
memory space was tight and measures such as execu- 
ting some micro-code in core memory or swapping 


micro-code in the pages were necessary. With the 
larger memory devices available now it is easy to 
expand the page memory capacity without increasing 
their physical size. In Series E, each of the 
three page memories holds 4,096 words and can be 
doubled to 8,192 words if necessary. The memory 
devices are faster so 100-nanosecond instruction 
fetch rates can be supported (compared to 120 
nanoseconds in the original design). 


Magnetic-core memory space was also tight in 
some applications of the original design. In 
these applications a significant amount of core 
memory space was used to unload the small MDA 
memories and/or buffer data on the I/O channels. 
In Series E the MDA memories are much larger and 
the I/O channels communicate with the MDA memories 
directly so the core memory is relieved of this 
burden. The core memory capacity in Series E is 
the same as the original design (32,768 words). 


Faster Processing Rates 


There are about two MDA memory read steps, 
and one array register transfer for every MDA 
memory write step in the typical application 
program. The following table uses this ratio and 
the read and write cycle times of the original MDA 
memory and the MDA memories of Series E to show 
the effect of MDA memory times on the processing 
rate. 


Original Series E Series E 
MDA Bipolar MOS 
Memory Memory Memory 
Array Register 120 100 100 
Move Time (nsec) 
Read Time (nsec) 120 120 420 
Write Time (nsec) 300 160 420 
1 Reg.Move + 2 read 
+ 1 write (nsec) 660 500 1360 
Relative Process- 
ing Rate 1 1.32 0.49 


With no changes in the system micro routines, the 
processing rate of Series E would be close to that 
of the original design. 


. The small MDA memory in the original design 
limited the arithmetic micro-routines to little or 
no temporary space for their calculations. This 
had a severe “impact on the execution times of the 
multiply, divide, square root and floating-point 
operations. With the much larger MDA memory of 
Series E some of the memory space can now be given 
to these operations for temporary storage. The 
micro-routines for these operations were rewritten 
to use the vector stack in bipolar MDA memory for 
temporary storage. Some examples of the: speed 
improvement are shown in the following table. 
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Series E speed/ 


Operation original speed 
32-bit floating-point add 3.0 
32-bit floating-point multiply 4.0 
32-bit floating-point divide 2.0 
16-bit fixed-point multiply 1.8 
16-bit fixed-point divide 1.5 


Since the floating-point micro-routines had to be 
recoded, it was decided to allow other precisions 
besides single-and double-length. The precision 
of these operations can then be tailored to match 
the precision of an attached host computer or to 
match problem requirements. A maximum precision 
of 100 bits was selected -- this is large enough 
to cover most applications and small enough to be 
handled conveniently in the MDA memory vector 
stack. No special problems arise if the precision 
is two or more bits so a minimum precision of 

2 bits was selected. Users can adjust the 
precisions of floating-point operands anywhere in 
this large range. Operands with different 
precisions can be combined and results stored with 
another precision in the four basic floating-point 
operations: add, subtract, multiply and divide. 
The execution time and vector stack space used 
depends on the operand precisions. 


To accomodate host computers with different 
exponent lengths, floating-point operands can have 
a base-2 exponent with 7 to 1ll-bits. In the basic 
operations all operand exponent lengths must agree. 


The format of floating-poing numbers in 
Series E was selected to maximize performance. 
The format is one sign bit followed by 7 to 11 
base-2 exponent bits followed by 2 to 100 mantissa 
bits. Exponents are biased by 64, 128, 256, 512, 
or 1024 depending on exponent length. Non-zero 
numbers have normalized mantissas (the most- 
significant mantissa bit is always 1). All bits 
of a floating-point-zero are 0. 


The format of fixed-point numbers is the same 
as in the original design--a two's-complement 
representation with three or more bits, and any 
scale factor. 


Input-Output 


The array modules of the original design had 
two ways of inputting and outputting data: 
through the 32-bit common register or through an 
optional parallel input-output (PIO) unit. The 
PIO unit had wide ports (256 bits) into each array 
module and allowed transfer of data at 80 mega- 
byte/second rates. Each array port had 1024 wires 
(256 twisted-pair inputs and 256 twisted-pair. 
outputs) so the PIO unit was relatively expensive. 
In some applications the common register path was 
too slow, inconvenient to use, or required large 
buffer space in the control memory. | 


In Series E the array 1/0 was redesigned. 
We found that data can be reliably transferred 
over a 32-bit-wide path at 80 megabytes/second so 
the high I/0 rates supported by the original PIO 
unit can be accomplished with busses only 1/8 as 


wide. The 8-to-1 reduction of bus width through The I/O busses of the array modules are 
the I/O unit reduces its cost dramatically. coupled to a cross-bar to permit data transfers 


between array modules and I/O to external devices. 
Each array module has a multiplexer-demulti- 


plexer (MPX/DEMPX) to pack and unpack data between 


the 256-bit-wide internal busses and the 32-bit- [1] J. D. Feldman and L. C. Fulmer, RADCAP - An’™ 
wide 1/0 busses (see Figure 1). A control unit operational parallel processing facility, 
associated with the MPX/DEMPX steals an MDA memory AFIPS Conf. Proc. Vol. 43, pp.7-15 (1974 
cycle to fetch or store I/O data -- both the Nat'l. Computer Conference) . 

access mode and the address come from registers in 

the MPX/DEMPX control unit. [2] K. E. Batcher, The Multidimensional Access 


Memory in STARAN, IEEE Trans. on Computers, 
Vol. C-26, no.2, pp. 174-177, Feb. 1977. 
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STARAN E PERFORMANCE AND LACIE ALGORITHMS 


Roger L. Boulis and Rudolf 0. Faiss 
Application Engineering 
Goodyear Aerospace Corporation 
Akron, Ohio 44315 


Abstract. The features of the Goodyear. 
Aerospace Corporation's new STARAN\?/ E model 
computer are described and its architecture is 
discussed. The implications of the new archi- 
tecture on array storage, input/output (1/0), 
arithmetic, program design, and software 
generation capabilities are related to corre- 
sponding features of the earlier STARAN B 
model. Through examination of, the use.of the 
STARAN E for performing LACIE tasks now done 
by the STARAN model B machine, the utility of 
the new features is shown. 


Introduction 


The purpose of this paper is to introduce 
the reader to the improvements that can be 
afforded to a STARAN E model user over the ear- 
lier B model system. 


Throughout the paper, potential users of 
parallel processors are acquainted with the fea- 
tures of the Goodyear Aerospace Corporation 
STARAN B and E model computers. A critique of 
the STARAN B model as currently applied to the 
LACIE investigation [1] provides the basis upon 
which the STARAN E model establishes its merit. 


This paper is presented in three basic 
sections. The first section provides a short 
summary of the STARAN computer and its usage 
experience at the various STARAN installations. 
This is accompanied by a more detailed descrip- 
tion of the functions the STARAN B is required 
to perform on LACIE algorithms at Johnson Space 
Center (JSC). The second section describes the 
highlights of the STARAN E model, making refer- 
ence to the STARAN B model where applicable. 
The last section discusses the advantages the 
STARAN E model can demonstrate over the STARAN B 
model when the LACIE algorithms are adapted to 
the new STARAN E architecture. 


Background 


STARAN Architecture Summary 


The STARAN is a modularly constructed com- 
puter in which many identical operations may be 
executed simultaneously; that is, it is a 
"single instruction stream, multiple data stream" 
processor. 


The basic STARAN building block module is 
called an array. It consists of an array memory 
section, 256 bit-oriented processing elements, 
and a routing network/address structure that 
allows for multidimensional access of the array 


(a) 


TM, Goodyear Aerospace Corporation, 
Akron, Ohio 44315 


Large Area Crop Inventory Experiment at 
NASA's Johnson Space Center, Houston, Texas 


(b) 


memory by the processing elements. Local array 
control allows selective enabling of data 
streams. A single STARAN control unit broadcasts 
instructions to all enabled array modules. 


STARAN's modular construction allows for the 
incremental increase of not only working storage, 
but also, memory-to-processing element bandwidth 
and processing elements. For example, in a one 
module STARAN, an "add" operation can be executed 
simultaneously for 256 pairs of numbers (or 256 
data streams). The parallel execution of an 
operation for many data pairs is made possible 
by employing many processing elements (256) per 
module. In a two module STARAN, twice as many 
adds can be performed in the same time interval 
because twice the resources are provided; 512 
data streams may be treated simultaneously. 


The high processing and throughput speeds of 
STARAN are a direct result of its parallel pro- 
cessing architecture. 


STARAN Usage Experience 


Since its introduction to the commercial 
marketplace in the spring of 1972, STARAN B has 
been used to solve a variety of applications 
problems. Independent of the applications, the 
tasks that have been implemented in STARAN for 
problem solving can be categorized into two 
major types, namely, bit and bit-group manipu- 
lation type tasks. 


The former type tasks generally require 
access to specific individual bits of both input 
and intermediate task data items. Bit manipu- 
lation capabilities are commonly employed for 
solving problems that require automatic deci- 
sions, e.g., problems that arise out of data base 
management, text searching, command and control, 
and air traffic control applications. They are 
also used for problems that are dominated by one 
bit data items. A number of problems that arise 
out of cartography, graphics/drafting, weapons 
sensor processing simulation, and attribute-to- 
boundary correlation applications are of this 
nature. 


The second task category, the bit-group 
manipulation type task, allows access to data 
items by bit-groups. An N-bit data item may be 
treated N bits at a time (as in the multipli- 
cation of two such items). Such tasks are 
generally of an arithmetic nature and most often 
are add or multiply tasks. Applications that 
employ bit-group processing are far ranging, and 
those applications that are most likely to de- 
mand that bit-group tasks be executed at high 
rates are those that deal with vectors or arrays 
of N-bit data items. Examples of applications 
that treat data of this type at high rates include 
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image processing, signal processing, weather fore- 
casting, reactor design, and fluid dynamics appli- 
cations. | 


STARAN's processing elements were designed 
for the manipulation of individual data bits of 
data items in a great number of data streams, 
i.e., it is a bit manipulator. At present, 
STARAN is optimized for performing bit manipu- 
lation tasks. 


Bit-group manipulation tasks are accomplished 
in STARAN by executing a regular sequence of 
operations on the individual bits of bit-group 
operands. As a result of the manner in which bit- 
group tasks are accomplished by STARAN, arbi- 
trarily sized N-bit group data items are treated 
with equal ease. Thus, a 5-bit x 51-bit multiply 
task is accomplished with no more programming 
difficulty than a 16 x 16-bit multiply task; both 
multiply tasks demand about the same processing 
time. 


Three STARAN B installations other than 
those at GAC~ the Rome Air Development Center 
(RADC), the Engineering Topographic Laboratory 
(ETL) , and the Johnson Space Center (JSC) instal- 
lations are now in regular use. The RADC and 
ETL STARAN's are being used primarily for appli- 
cations that require STARAN to perform bit mani- 
pulation tasks, whereas the STARAN of the most 
recent installation at JSC is being used pri- 
marily for accomplishing bit-group manipulation 
tasks; more specifically, for vector arithmetic 
processing tasks. 


While the usage experience at all three 
installations has influenced the choice of 
STARAN enhancements implemented in GAC's upgraded 
STARAN E machine, the LACIE application task set 
executed by the present JSC STARAN B will be used 
as the primary vehicle to demonstrate the im- 
provement of the new STARAN E over the older 
STARAN B machine. : 


This task set has been chosen as a demon- 
stration vehicle deliberately for the following 
reasons: (1) the STARAN B installation is being 
used in a semi-production fashion, and so good 
installation usage statistics are available; 
(2) the LACIE application requires the repeated 
execution of bit-group tasks; and (3) the cost 
statistics for the development of the LACIE 
STARAN software are well known. 


Rationale for STARAN in LACIE 


Prior to the LACIE program, NASA had de- 
veloped an enhanced Earth Resources Interactive 
Processing System that was structured to utilize 
one of the five IBM 360/75 computers in the real 
time computer complex in the Mission Control 
Center of JSC. It was determined to be desirable 
to use this same software/hardware system for 
implementing the LACIE program. Yet, the com- 
plete implementation of the LACIE software set 
within the computers of the Mission Control 
Center would have severely drained the complex of 
its compute power. On the basis of a competi- 
tive assessment, NASA ultimately chose a two 
array STARAN B to off-load the computationally 
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demanding LACIE data processing tasks [1], In 
particular, STARAN B was required to perform: 
(1) Statistics, (2) Iterative Clustering, (3) 
Adaptive Clustering, (4) Maximum Likelihood 
Classification, and (5) Mixture Density tasks. 


This paper will show why the STARAN E would 
simplify the design and coding of software for 
achieving the above LACIE tasks. It will further 
show that STARAN E could have performed the tasks 
more rapidly and potentially could have reduced 
the number of arrays required for processing from 
2 tol. It will show that STARAN E would have 
been a more desirable off-loading vector pro- 
cessor than the STARAN B. 


STARAN E Highlights 


General 


Since its introduction, the STARAN B model 
computer has satisfied and even exceeded system 
performance predictions for a variety of appli- 
cations. The extensive and varied use has pin- 
pointed certain STARAN B limitations. At times, 
users have found the STARAN B word length too 
short, the data transfers between array memory 
and program memory too slow, and the size of the 
page memories insufficient. To remove these 
limitations, the STARAN B model architecture was 
modified in three major areas: (1) multidimen- 
sional access (MDA) array memory size/speed, 

(2) control memory size/speed, and (3) MDA array 
memory 1/0. 


Significant Hardware Features 


Figure 1 is a block diagram of the resulting 
STARAN E with the modified hardware outlined. 
The new hardware will be discussed below. 


MDA Array Memory Size/Speed. The heart of 
the STARAN E model is the MDA array memory, a 


two dimensional matrix of bits. There are three 
basic components of each array module in a STARAN 
system, whether it be an E or B model (see 

Figure 2). A set of processing elements (PE's) 
is connected through a permutation network to a 
high bandwidth MDA memory. Rows, columns, or 
other subsets of data can be read in parallel 
from the memory, permuted in various ways as they 
pass through the permutation network, and then 
can be combined with other data in the processing 
elements. The processed results can be again 
permuted and stored into memory in various ways. 
The arrays in the STARAN E model are still the 
basic modules from which STARAN E systems of 
varying size and power are constructed. The 
maximum number of STARAN E arrays in a given 
system is eight. The size of the STARAN E arrays 
has been increased from 256 bits per word, as in 
the STARAN B model, to an allowable maximum of 
65,536 bits per word. Each array still contains 
256 words. Any of the 256 PE's within an array 
still has access to data in any array in the 
system. A maximum STARAN E model configuration 
is shown in Figure 3. 


Two types of memory may be employed in each 
memory module - fast bipolar 1,024-bit random 
access memory (RAM) and slower metal-oxide semi- 


conductor (MOS) 4,096-bit RAM. The former has 
emitter coupled logic ECL electrical character- 
istics that provide a read time of 120 nsecs and 
a write time of 160 nsecs. The latter exhibits 
standard MOS electrical characteristics, yielding 
420 nsecond read and write times. The allowed 
mix and amount of these two types of memory is 


somewhat arbitrary, but has been chosen at 1K fast 


RAM memory and 8K of slower MOS memory for the 
first STARAN E built. 


Each array of the first STARAN E model is 
organized as a matrix of 256 words by 9216 bits, 
as shown in Figure 4. This matrix is further 
broken down into 36 256-word x 256-bit square 
segments. As in the STARAN B model, within these 
segments, it is possible to read or write all 
bits of one word, or one bit of all words, or a 
few bits of many words, or many bits of a few 
words - all in one memory operation. The M, X, 


and Y processing element (PE) registers (Figure 4) 


are still 256 bits wide each and may be used as 
temporary storage for data moved to or from the 
array. 


Addressing the STARAN E model arrays is 
almost identical to the scheme used by the 
STARAN B model. Due to the extended length of 
each array word, a base register philosophy was 
incorporated in the STARAN E model and is de- 
picted in Figure 5. A set of six, 32-bit regis- 
ters provides the capability of addressing the 
36 segments of 256 words by 256 bits of the first 
STARAN E model. A maximum capability of ad- 
dressing 256 of these segments (corresponding to 
the maximum array size) is provided. 


Control Memory: Size/Speed. 
of the STARAN B: control memory is to contain 
assembled application programs. It has also 
proven very necessary as a data buffer used by 
STARAN control,. the MDA arrays and the I/O chan- 
nel of a host computer connection. The maximum 
address space of the STARAN B model is 65,536 
32-bit words. A maximum of 32,768 32-bit words 
are reserved for magnetic core memory; 1024 words 
for the high speed data buffer memory. These 
figures have not changed in the STARAN E model. 
The high speed page memory system has been en- 
larged considerably though. Up to 8,192 32-bit 
words can be stored in each of the three page 
memories. Instruction execution speed has been 
decreased almost 20% to 100 nanoseconds. The 
remaining memory address space can be utilized 
for addressing the memory of a host computer if 
so connected. The first STARAN E model built has 
16,384 words of core memory, 4,096 words in each 
of 3 page memories, and 512 words of high speed 
data buffer memory. 


MDA Array Memory I/O. Interarray communi- 
cation and I/O between the arrays and external 
devices can occur in two different ways for the 
STARAN B model. The most usual. method is by way 
of the 32-bit wide common register - a path that 
can achieve a 12 to 15 megabit per second data 
transfer rate. The optional parallel I/O (PIO) 
path, with a channel width of 256 bits to each 
array in the system, provides the larger. band- 


The main function 
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width in the STARAN B model of 640 megabits per 
second, but at the expense of a great deal of 
circuitry and cabling. 


One of the main features of the STARAN E 
model is direct access to data within the MDA 
arrays from an external device. The array access 
is made by stealing a machine cycle from STARAN 
control. In this way, STARAN does not have to 
expend processing time assisting in I/O, but 
instead may devote all its time to array pro- 
cessing. 


Three I/O ports are provided at each MDA 
array as shown in Figure 6. The first, a 256- 
bit PIO path, is capable of transfer rates from 
512 megabits per second up to 2560 megabits per 
second. The second is a 32-bit wide multiplexed 
I/O (MIO) path with the same basic bandwidth as 
the STARAN B model's PIO - from 80 to 640 mega- 
bits per second. The third is the standard com- 
mon register path to STARAN control that cur- 
rently exists in the STARAN B model. Its trans- 
fer rate varies from 12 to 15 megabits per 
second. Of these three ports, only the latter 
two are ut4lized in the first STARAN E modes. 
built. 


Implementation of the array I/O is made 
possible by four hardware units:(1) the array 
access resolver, (2) a multiplexed I/O con- 
troller, (3) the multi-port crossbar switch, and 
(4) a STARAN command channel (SCC). These are 
shown in Figure 6 for a two array STARAN E model 
configuration. The resolver block was shown 
separately in this figure for clarity. Normally, 
its function is assumed by the MIO block for 
diagrammatic purposes. 


e Access Resolver. The STARAN E array can 
be controlled by any one of three units - STARAN 
control, MIO control, or PIO control. Conflicts 
are controlled by the access resolver. A "snap- 
shot" scanning resolver is employed to decide 
which device is allowed access to the array. 
There are four types of requests that the re- 
solver looks for. In order of priority, they 
are: (1) STARAN control, (2) PIO, (3) MIO read, 
and (4) MIO write. A snapshot is taken only 
when a PIO, MIO read, or MIO write request is 
made and the requested array is available. 
STARAN control is the unit that in effect grants 
the resolver's request to service another unit. 
The array(s) will only be available if STARAN 
control is not currently utilizing that array. 
The highest priority request within the snapshot 
is then honored for one MDA array memory cycle. 
If other requests are pending, each of them gets 
one memory cycle on a priority basis. This pro- 
cedure continues until all devices eventually drop 
their requests. 


STARAN' s exclusive utilization of the arrays 
is not interrupted unless an array tests busy. 
At that point, basic array access efficiency re- 
mains high but STARAN control ea drops 
slightly. 


e MIO. As shown in Figure 6, each array in 
a STARAN E model system employs a MIO that allows 


array I/O to occur on a cycle stealing basis. destination addresses, block size, and other per- 


The array I/O can be initiated by the execution tinent data. The connection is broken as before 
of an I/O instruction of an external I/O proces- and the MIO is activated. The contents of regis- 
sor (IOP) or by the execution of an associative ter CR3 are placed on the output and loaded into 
instruction of the STARAN E proper during an the IOP. The corresponding array words follow. 
interarray transfer. These data transfers take Once a block transfer is initiated, only an "ex- 
place over the 32-bit wide array I/O busses as change" operation can intervene. Each time an 
coupled together by the crossbar circuitry. The exchange does occur, the MIO will reinitiate the 
same bandwidth as the STARAN B's PIO is maintained connection and continue the transmission where 
by "burst" transmitting the data with a clock rate it left off. A "receive" can also occur at the 
of 50 nanoseconds per 32-bit word. Another 32-bit MIO and allow data to be written in the array on 
port connected to the MIO is the common register a cycle stealing basis using the "block transmit" 
data path from the STARAN E model mainframe. The capability. 
oene data transfer Toes a the STARAN B's cpareuae The block transmit function can also be ini- 
register data path is maintained (15-20 megabits tiated by STARAN control writing CR2, CR3, and 
per second). CR4 in the source array MIO and allows data to be 
The smallest data item passed through the sent to one or more arrays from the source array. 
MIO from the crossbar unit is 256 bits. A clocked Register CR3 in the source array is loaded into 
multiplex/demultiplex scheme is employed in the CR1 of the destination MIO(s) and the array words 
MIO that breaks a 256 bit item from the array into follow. The source array MIO in a block transmit 
eight 32-bit words during transmission to the mode will not accept another continuous transmit 
crossbar. Likewise, when eight 32 bit items are or block transmit request until the current one 
received by the MIO from the crossbar, they are is complete. However, the source array MIO will 
packed into a 256-bit item for parallel transfer allow "exchange" and "receive" operations to occur 
to the MDA array. The MIO, as shown in Figure 7 as explained above. 
is designed such that data transfers may occur 
in both directions simultaneously, but not neces- Receive. An IOP may write data to an 
sarily at the same rate. Data input to the MIO array by first connecting to the MIO. Next, CRIl, 
from the crossbar touches both the demultiplex the "receive control register" is loaded. Data 
buffer and the control register input. MIO data is then transmitted into the MIO. The receive 
output to the crossbar comes from the multiplexed operation will allow any other operation to inter- 
output buffer or the control registers. Control vene on a cycle stealing basis. However, the IOP 
registers may be transferred between each other or source array must have the capability to re- 
over an internal path. The MIO circuitry is cap- initiate the "receive" operation. 
able of several I/O functions that are useful to 
internal STARAN E model data manipulation as well Exchange. Another type of interarray 
as external I/O. In most cases, the MIO is acti- communication that can be performed by the 
vated by loading the internal control registers STARAN E model is the "exchange" function. This 
shown in Figure 7. The functions that the MIO function uses the data path afforded by the MIO 
is designed to perform include: (1) continuous and the crossbar switch to perform a synchronous 
transmit, (3) block transmit, (3) receive, and data transfer. Array address and control inform- 
(4) exchange. These are pictorially represented ation are provided by an associative instruction 
in Figure 8. executed within STARAN control that triggers the 
exchange function. Array connectivity is con- 
Continous Transmit. An IOP may initiate trolled by the connection registers within the 
this mode when it is required to read MDA array crossbar switch. All data manipulations performed 
data. It establishes a hardware connection to by the associative exchange instruction are iden- 
the specific array MIO through the crossbar switch, tical to their non-exchange counterparts. Its 
writes the "continous transmit" register (CR2) and basic purpose, however, is to provide interarray 
then breaks the connection. The MIO is now acti- data communication. The contents of one memory 
vated and will establish the connection to the location may be swapped synchronously with the 
LOP and begin transmitting array data. The trans- contents of another's memory location. Or, data 
mission continues until: (1) IOP control termin- from one array may replace data in all the other 
ates the transfer, or (2) CR2 is loaded again by arrays in the system if so desired - all within 
another IOP. The continuous transmit mode allows the same period of time that it takes to read the 
several IOP's to obtain data from the same array data from the source array. A variety of other 
during a given time interval. exchange options are available to the STARAN 
The continous transmit mode requires that the iia 
intelligence controlling the transfer be on the e Crossbar Data Switch. The crossbar data 
receiving (IOP) side of the transfer. Data trans- switch of the STARAN E model provides part of 
mission from the MIO can cease at any time and it the path that allows MDA array data to be ac- 
would be up to the IOP hardware to restart it. cessed by external devices (and other arrays) 
without direct STARAN participation in the trans- 
Block Transmit. An IOP may also read fer. The crossbar is an eight port data switch 
array data a block at a time. A connection is that has one of its ports connected to each array 
made as before and registers CR2, CR3, and CR4 are in the system. Each port contains 32-bit wide 
written into the source array MIO with source and data paths and operates at 80 to 640 megabits 
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per second. Remaining ports may be connected to 
IOP's for direct MDA array transfers. A port is 
defined as having simultaneous input and output 

capabilities. 


Large systems may have up to eight arrays 
and have several IOP's attached. In order to 
accommodate situations like this, the crossbar 
design allows a second crossbar to be added to 
the first, thus providing 14 ports. Up to four 
crossbars may be ganged together like this to 
provide 20 free data ports. 


All STARAN E systems may also use the cross- 
bar data switch to transfer information from one 
array to another. The transfer may be synchron- 
ous under control of a STARAN associative 
instruction, or asynchronous under MIO control. | 


The STARAN E model instruction set was ex- 
panded to incorporate several new instructions’ — 
that control the MIO, crossbar and SCC. Those 
required for the crossbar proper: allow four basic 
functions to be performed: (1) reset, (2) read 
registers, (3) write registers, and (4) strobe 
status. These instructions are issued over the 
SCC to the crossbar and MIO devices. The control 
registers in the crossbar as well as the MIO are 
accessible by any device attached to the SCC... 


e STARAN Command Channel (SCC). The SCC 
is the vehicle by which command and control 
information is passed from STARAN control to the 
IOP's and MDA array MIO's that are attached to 
the crossbar data switch. The SCC is driven by 
I/O instructions executed by STARAN. 
to command transfer, the SCC also provides a 
data output and input path 36 bits wide, in- 
cluding 4 bits of parity. LOP interrupts are 
also supported by the channel. The SCC, crossbar 
data switch, and MIO are all involved during 
interarray Sperat tong and are initiated by 
STARAN I/O instruction execution. The SCC ori- 
ginates at STARAN control and can be daisy- 
chained from its first connection at the cross-_ 
bar control logic to any IOP's within the system. 

STARAN E vs. STARAN B in LACIE 
STARAN Program Storage 

It was indicated earlier that the LACIE pro- 
cessing at NASA is handled by IBM 360/75 host 
computer(s) that off-load five computationally 
demanding LACIE pattern recognition tasks to a 
STARAN B computer. The particular 2-array 
STARAN B used is equipped with a 136K byte (34K - 
32-bit word) control memory. The STARAN B is 
connected to the selected host computer via a | 
custom-built channel-to-channel interface unit 
that connects to STARAN's buffered I/O (BIO) port. 
Data that transfer through this port are read 
from or are written to the STARAN B model's 
program memory (i.e., program memory is used for 
data as well as instruction storage). 


| The maximum intermachine data transfer rate 
is restricted to less than one Megabyte per 
second as a result of the need for very long 
cable lengths. (Since fundamental data transfer 
rates that can be supplied by the host during) 
e 


In addition | 


the execution of a STARAN LACIE task lie sub- 
stantially below this rate, the one Megabyte per 
second intermachine transfer rate limit imposed 
by the channel causes no apparent impact on LACIE) 
processing.) To support block data transfers 
between machines, receive and send I/O buffers 

are provided by both machines. The size of these 


buffers is related to the maximum data block size -: 


allowed in a single one-way data transfer. To | 
minimize the number of I/O interrupts the host . - 
machine is required to process, it is desirable 
to use large data block transfers. Furthermore, . 
to support simultaneous STARAN processing, host © 
processing, and intermachine data transfers, 

more than one I/O buffer is required in each 
machine. As a result, a considerable portion of 
STARAN B program memory is allocated for use as 
I/O buffers. At present, data block sizes for 
one way intermachine transfers are restricted to 
20K or less bytes. For reasons imposed by inter- 
machine I/O protocol, on the order of 60K bytes 
of buffer space must be made available in STARAN 
to achieve a double buffering capability. 


Thus, a large percentage of STARAN control 
memory is allocated for use by transient data. 
Yet other data, namely input parameter data, re- 
quire large allocations of program memory stor- 
age space for classification tasks when pixel 
measurement vectors have many components. In 
particular, when 20-component measurement vectors 
are associated with the pixels to be classified, 
when the reference statistics for 60 crops are — 


to be ‘used in classifying pixels, and when the 
. Mixture Density Classification task is used to 


- accomplish classification, the input parameter — 


data (crop reference statistics) require on the © 
order of 56K bytes of program memory. Another — 


4K bytes of program memory is required for inter- 


mediate data storage when the above task is exe- 
cuting. As a result, only on the order of 16K 


bytes of program memory is available for the LACIE | 


executive, STARAN I/O handlers and LACIE task © 
application modules. The actual system software 
requires about 8K bytes of the remaining 16K 
bytes, leaving 8K bytes of program memory for all 
five LACIE applications tasks. Since the amount 
of code for the LACIE tasks exceeds 40K bytes of 
instructions, insufficient program memory exists 


in the JSC STARAN to allow the various STARAN — 


LACIE TASK modules to co-reside within the 
STARAN-B program memory. In fact, the complete 
instruction set for only one LACIE application 
task exists in the program memory at any one 
time. The remaining programs are stored on a | 
disc associated with the STARAN system. Thus, 
when the host calls upon STARAN to execute a 


. task, STARAN must first determine whether or not 
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the desired task program already exists in pro- 
gram memory. If the program is not resident in 


_the memory, it must be called in from dise stor- 


age. Such an operation results in substantial 
delays and becomes particularly noticeable when 
task calls treat Ee labivey few pixel measurement 


vectors. 


In order to alleviate the delay problem, the 
applications programs were separated into 
initialization and processing segments; the ini- 


tialization segments for all five tasks were com- 
bined and are kept in program memory. Thus, if 

a task needs to be loaded from disc, STARAN may 
proceed with the execution of initialization 
actions as the processing segment of the task 
program is being loaded. 


The need for such "cleverness" in managing 
the LACIE application programs disappears when 
using the STARAN E machine in place of the B 
machine. The program memory of the E machine is 
supplemented by 288K bytes of array memory. If 
an E machine were used for the LACIE program, it 
would be tied to the host via a STARAN E crossbar 
port. Pixel measurement data would pass directly 
into STARAN array storage rather than into pro- 
gram storage. Intermediate task results would 
remain in array memory. And the bulk of the pro- 
gram memory would be available for program stor- 
age. As a result, all five STARAN LACIE appli- 
cations programs would co-reside in the three 16K 
byte page memories. Delays in task execution 
disappear with the elimination of the requirement 
to load programs from disc. Even short tasks 
would be executed efficiently since all tasks 
would execute out of high speed page memory. The 
systems software and the applications program 
design would be reduced substantially. All soft- 
ware now would fit comfortably within the 48K 
byte page memory system, and program segmenting 
would be eliminated. The availability of large 
I/O buffers would permit simpler I/O handlers. 


STARAN Array 1/0 


When the JSC STARAN B I/O buffer contains 
pixel data and the task program requires it to 
be moved to array storage, the data must be moved 
to array storage by way of the common register. 
When the transfer is made most efficiently, the 
full bandwidth of this path is only on the order 
of 2 Megabytes per second (as compared to the 
512 Megabyte per second bandwidth for the pro- 
cessing elements~-from-arrays path). During such 
transfers, all STARAN processing stops. 


Because data transfers between the STARAN 
program and array memories occur at a slow rate, 


and because such transfers degrade the processing’ 


power of the machine, much effort was spent to 
attain a software design for each task that would 
eliminate the use of the program memory for 
storing intermediate task data. Only the soft- 
ware design for the Maximum Likelihood Classifi- 
cation task was able to eliminate such transfers; 
all other tasks required the movement of at least 
some intermediate data through the common path. 


The impact of moving intermediate data was 
most severe on the timing of tasks that are used 
to establish crop reference statistics; namely, 
the Statistics, Iterative Clustering, and Adap- 
tive Clustering tasks. The ratio of the amount 
ef time spent for making common path data trans- 
fers compared to the time required for arithmetic 
processing is dependent on both the task and task 
setup parameters. Ratios of 5:1 for Iterative 
Clustering and 1:5 for classification are repre- 
sentative of those likely to be observed while 
performing commonly encountered LACIE jobs. 
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The STARAN E would eliminate the requirement 
to move input, intermediate, or output data 
through the common path since such data would 
remain in the array memory throughout a task. 

The execution speed of all tasks would increase 
significantly; for the clustering tasks, dramatic 
execution time improvements in excess of 2:1 
could be expected. The construction of code to 
implement intermediate data storage would be sim- 
plified since the temporary storage region would 
be in the same array memory that produced the 
intermediate data. 


STARAN Algorithm Development 


The STARAN B model array word is 256 bits 
long; the two array JSC STARAN B model provides 
512 such words. Regardless of the LACIE task 
called, 512 pixel measurement vectors are treated 
at a time; one vector is assigned to each array 
word. To satisfy LACIE requirements, vectors 
with up to 20 8-bit components have to be ac- 
cepted as input. To provide worst-case capa- 
bility, 160 bits of a STARAN array word are re- 
quired for storage of the input pixel data. The 
remaining 96 bits are available for storing 
intermediate pixel related data and for per- 
forming required arithmetic/logic operations. 
The 96-bit wide space is inadequate for storing 
intermediate results for all but one task (Maxi- 
mum Likelihood Classification) and, as was de- 
scribed earlier, forces the use of program mem- 
ory for storage. 


Because of the undesirability of using pro- 
gram memory for the storage of intermediate 
results, the software design effort sought to 
(1) minimize the number and types of intermediate 
data items and (2) reduce the precision of all 
such items to the bare minimum required to ful- 
fill LACIE accuracy requirements. The design 
guidelines resulted in the intensive examination 
of the basic arithmetic descriptions of the 
tasks. This allowed arithmetically equivalent 
forms to evolve that could be computed fast, 
operate in minimal field space, and generate 
intermediate data with well defined statistics. 
As examples, two major subroutines that allow 
the Maximum Likelihood Classification task to 
hold intermediate data in array storage are the 
common multiply and accumulate routine and the 
rounded square routine. The former routine con- 
serves array field space by eliminating the pro- 
cess of first generating a product field and 
then adding it to the accumulation of previous 


Clearly, the array space of the STARAN B 
model forced the software designer in this case 
to conserve as much field space for intermediate 
storage as possible. A STARAN E model applied to 
the LACIE application has no requirements for 
special field conserving subroutines. Further- 
more, the new variable floating point arithmetic 
routines that are provided by STARAN E's software 
language are at least as efficient as those 
developed specifically for the LACIE tasks. 
products. Instead, the product field is added 
directly into the accumulation field as the pro- 
duct is formed. The latter rounded squares 


routine squares the contents of a particular in- 
put field and simultaneously rounds the squared 
field to the length of the final output field. 


Summary/Conclusions 


The commercially available STARAN E model 
hardware/software enhancements include: 


© faster MDA arrays that provide a minimum of 
36 times the storage capacity of present 
STARAN B model arrays, 


@® new crossbar hardware that allows interarray 
data transfers from 8 to 64 times faster (at 
composite rates up to 640 megabytes per 
second per crossbar) and will allow host-to- 
STARAN array moves (at rates up to 80 mega- 
bytes per second per array), 


@ new page memories that provide 8 times the 
storage and allow up to 65% faster array 
instruction execution times, and 


@e a set of floating point arithmetic modules 
that allow the STARAN programmer to arbi- 
trarily specify the mantissa and exponent 
lengths. | 


The impact of using a STARAN E model rather 
than a STARAN B model, for the LACIE program, is 
summarized below. 


e All five LACIE applications software 
modules could be pre-loaded into the page section 
of STARAN E's control memory. No calls to the 
STARAN disc would be required after a task was 
requested by the host computer. As a result, 
even tasks involving few data items would be able 
to be performed efficiently in the STARAN E. 


e No intermediate data generated during 
the course of task execution would need to be 
stored in STARAN program memory. Intermediate 
data would be stored within the confines of the 
arrays. The time required to store or access 
such data would be reduced by a factor of at 
least 200:1. By eliminating the requirement to 
store data in program memory, all tasks would be 
able to execute in less time even if the fast 
array memory bandwidths had not been improved. 


@ Intermachine I/O would move data directly 
between STARAN arrays and the host. By using 
this path for moving data, no STARAN program 
memory data exchanges are required. As a result, 
the degrading effect on processing power, that 
data transfers along this path cause, is elimin- 
ated. 


e Software design and layout costs would be 
reduced dramatically as a direct result of having 
ample control and array memory resources. A one- 
array 9K bits/word STARAN E model provides about 
3 times the total storage of the two-array JSC 
STARAN B. The effort that was necessary to fit 
both programs and data into a limited storage 
facility would not be required. The new standard 
STARAN E model arithmetic modules provide exe- 
cution times that rival those of the specialized 
modules developed in LACIE. Thus, software debug 
time would be limited largely to main routines. 


Based on the results of the study, a single 
array STARAN E model machine can perform the 
overall LACIE tasks as well as a two-array 
STARAN B model. The one-array E model would 
execute clustering tasks faster than the two- 
array B model machine, but would execute classi- 
fication tasks somewhat slower than on the B 
model machine. It would outperform the B model 
when executing tasks involving only a small number 
of pixels. Both the STARAN application and 
systems software for the E machine would have 
been simpler to design, required less code, less 
time to code, less time to debug, less time to 
document, and less time to maintain. The overall 
software costs would likely have been halved. 


Acknowledgement 
The authors wish to express their gratitude 
for the technical support provided by John P. 
Rasmussen, one of the major design engineers 
responsible for the new array I/O architecture. 


References 


[1] R. Faiss, J. Lyon, M. Quinn, S. Ruben, 
"Application of a Parallel Processing Com- 
puter in LACIE," 1976 International Con- 


ference on Parallel Processing, pp. 24-32. 


[2] Goodyear Aerospace Corporation, "The 
- STARAN E System - An Overview," GAC Docu- 
ment Number AP-123226, 29 September 1976. 
[3] K. E. Batcher, "STARAN Series E," 1977 


International Conference on Parallel Pro- 
cessing. 


CONTROL MEMORY 


PAGE | 
MEMORIES 


MDA ARRAY MEMORY 


(>) 


AP 
CONTROL 


ron aAaZzoo™ 


DPmWNnNnowon 


Figure l. 


STARAN E Block Diagram 


SELECTOR 


a cove 


PERMUTATION 
NETWORK 


MDA 
ARRAY 
MEMORY 


PROCESSING 
ELEMENTS 


Figure 2. MDA Array Memory Module 


ARRAY SELECTION 


Figure 4. 


76543210 


| Tooooo100] = 


ARRAY SELECT REGISTER 


ARRAY ADDRESSING 
DIRECT 


11010100 


INSTRUCTION REGISTER 


Figure 5. 


1 


MEMORY 


AP 


Figure 3. 


MOS 
MDA 
MEM#1 


CONTROL 


CONTROL 


Typical MDA Array 


ARRAY MODULE 1 
65,536 BITS 


256 
WORDS 


DPYPrTNnNNonan 


ARRAY MODULE 8 
65,536 BITS 


256 
WORDS 


TOTAL STORAGE CAPACITY 


134.2 MEGABITS 
16.8 MEGABYTES 
4.2 MEGAWORDS (32 BITS/WORD) 


A Maximum STARAN E Configuration 


SELECTOR 


FLIP 
NETWORK 


wam<—rmouwums 


First STARAN E System 


MDA ARRAYS 


nie 


[ARRAY ADDRESS IN gueitecu se REGISTER) 


G2eash 
(MODE) 


0-010,0----0 
BASE REGISTER 


~O 


INDIRECT 


[ARRAY ADDRESS IN FIELD POINTER) 
8 BITS 


[11010100] 
FIELD POINTER 


+ 


16 BITS 


SEGMENT 
35 


SEGMENT 
36 


Addressing Example - First STARAN E System 


MEMORY 
INTERFACE 
BUFFER 


STARAN 
CONTROL 


exo [An]oC] oP] SA 
PIO CR1 TAMIMF[HM| DA | CRO ~ TRANSMIT CONTROL 


CR1 - RECEIVE CONTROL 
CR2 pan} be} DET: SB | CR2 -) CONTINUOUS TRANSMIT 


PIO ARRAY 


ers [aw[ur[an] pa] ERS - 
MIO cr4 |BL]| —- | MA | CR5 - STATUS 
5 nee 


CONTROL | REGISTERS 
CROSSBAR SWITCH 


ee eee ee 


MULTI- 
PLEXOR 


QUTPUT 
BUFFER 


SELECTOR 


Figure 6. STARAN E Array 1/0 


(8) 
AM - ACCESS MODE HM - HOME MASK 
BL = BLOCK LENGTH MA - MASK ADDRESS 
DA = DESTINATION ADDRESS MF - MASK FUNCTION 
DC - DESTINATION XBAR NO. SA = SOURCE ADDRESS 
DP - DESTINATION PORT NO. 


Figure 7. MIO Unit - STARAN E System 


CONTINUOUS 
TRANSMIT 


(GP. « 
INITIATED 


--—--—» COMMANDS 
———P DATA 


a 


BLOCK TRANSMIT 


1 iOP 
INITIATED 

2 STARAN 
CONTROL 
TRITIATED 


STARAN 
CONTROL 


4 SEPARATE BUFFERS REQUIRED } 
= IF "BLOCK TRANSMIT" & 
"RECEIVE" DONE TOGETHER — { 


RECEIVE 


TOP 
INITIATED 
STARAN 
CONTROL 
INITIATED 


no Kf 


< Det aceanen ere 


{io 2 BUFFERS REQUIRED } 
FOR A COMPLETE SWAP OF 
fae DATA ey 


food 
ame 


EXCHANGE 


1 STARAN eer 
CONTROL | SCC 
INITIATED] 

2 STARAN 
CONTROL 
INITIATED 


_ 


Figure 8. MIO Data Transfer Functions 


152 


SELECTOR 


CR3 -~>PENDING BLOCK TRANSMIT 


MULTIPLEXED 


<p>wD DY 


A MODIFIED ALAP CELL FOR PARALLEL TEXT SEARCHING 


Hubert H. Love, Jr. 
Hughes Aircraft Company 
Culver City, California 90230 


Summary 


The Associative Linear Array Processor 
(ALAP) and several of its applications are de- 
scribed in (1) and (2). An ALAP memory con- 
sists of a set of cells organized around four bit- 
serial data channels. Three of these are common 
buses connected to external registers. One of the 
common channels permits both arithmetic and 
match operations to be performed between the 
data in selected cells and an external operand. 
The selection of the operation is global. The other 
two common channels are for input and output. 
All three common channels can operate 
simultaneously. 


The fourth channel, the "chaining channel", 
permits each cell, under a combination of local 
and global control, to transfer data and control 
states to its nearest neighbor in one direction 
only. Each cell, under local program control, 
can either accept the data from its chaining chan- 
nel input, or else relay the data to its neighbor- 
ing cell. 


The major components of the ALAP cell are: 


1. A 64-bit shift register, the ''data regis- 
ter'', which holds the cell's data. 

2. An arithmetic unit which performs arith- 
metic and match operations between the contents 
of the data register and the data from the 
chaining channel or one of the common channels. 
The results can either be retained in the cell or 
put on the chaining channel to the next cell. 

3. A six-bit flag register". The states of 
the bits determine the operation of the cell's 
routing logic, and also determine whether or not 
the cell is to participate in input, output, data 
transfer or arithmetic operations during command 
execution. The flag bit states can be reordered 
and logically combined, and can also be trans- 
ferred from cell to cell via the chainring channel. 


There are three basic ALAP commands. One 
of these is used in conjunction with fault-detection 
software, and can effectively remove a malfunc- 
tioning cell from the array. The second command 
is the 'flag shift'' command which alters and re- 
orders the flag register contents. The third com- 
mand is the "word cycle'' command. This com- 
mand causes the data register contents of a 
selected subset of cells to be shifted simultane- 
ously while data transfer, match or arithmetic 
operations take place as specified by the flag 
registers and the states of global control and 
data lines. 


The chaining channel and the cell's arithme- 
tic and control logic gives the ALAP memory the 
capability for performing arithmetic, match and 
data-transfer operations among many sets of 
cells simultaneously. 
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The basic text-processing task discussedhere 
is that of locating all occurrences of a specified 
set of "'key'' words appearing, in any order, ina 
large data base of raw text, such that the words 
lie within a specified range of character positions. 
The sentences which encompass each occurrence 
of the set are output to the user together with the 
associated document identifiers. 


For this application, several modifications to 
the basic ALAP cell are required. The structure 
of the modified cell is shown in Figure l. 


CHAIN IN FROM PREVIOUS CELL 


CHAIN OUT TO 
NEXT CELL 


DATA REGISTER 


Figure 1. Modified ALAP Cell 


(26 BITS) 


The principal modifications are the following: 

1. Instead of 64 bits, the modified cell 
has a data register of 64,000 bits for storing the 
text. It is hoped that the use of a very long data 
register will reduce the cost-per-bit of the modi- 
field ALAP memory to the order of that for a medium 
priced disk. The data registers are fabricated on 
the same multi-cell wafer with the rest of the cell 
logic. : 

2. The modified cell contains an additional 
register, the "auxiliary register", together with 
some associated data routing and arithmetic logic. 
This 24-bit register can recirculate through a 
small arithmetic unit which can subtract or add 
a globally-specified operand to the register con- 
tents. In addition, the auxiliary register inter- 
faces with the chaining channel in the same fashion 
as the data register. 

3. Unlike the original ALAP cell, the arith-. 
metic logic for the modified cell includes no step- 
multiply, step-divide or step-square-root 
capability. 


For the text-processing task, the entire text 
resides in a succession of eleven-bit "character 
fields'' in the data registers, each field contain- 
ing one character plus three flag bits. Sentences 
and even character fields may lie across cell 
boundaries. Since the chaining logic permits all 
cells to shift their contents simultaneously on the 


chaining channel, the entire ALAP memory can 
act as one large shift register, and the cell 
boundaries can usually be ignored during 
processing. 


The search procedure is conducted in three 
phases. The first of these is the operation of 
locating all occurrences of each key word in the 
specified set, and of tagging them in the flag bits 
of their most-Significant characters, The second 
phase is the process of locating and tagging all 
occurrences of the set of key words which lie 
within the specified range. The third phase con- 
sists of outputting to the user the sentence or 
sequence of sentences which contain each match- 
ing set of words, together with the document 
identifiers. All three phases of the process are 
performed in parallel for all cells, and are inde- 
pendent in execution time of the size of the data 
base. However, the execution time for the third 
phase is dependent on the number of matching 
sets, since the matches are output sequentially. 


In Phase 1, a number of parallel searches 
equal to the number of characters in the word are 
performed so that all occurrences of the word 
may be found, regardless of their orientation. 
Figure 2 illustrates this process. The contents 
of three (abbreviated) cells are shown with the 
chaining channel interconnections indicated by 
the arrows. They key word being sought is 
"TRUTHS". Below are shown four of the six 
comparands, corresponding to four of the six 
orientations of the key word, as they are fed into 
the ALAP memory from an external register. 
The third of these is seen to match successfully. 
Phase 1 requires n complete word cycle opera- 
tions, where n is the total number of characters 
in all of the key words. 


Phase 2 of the operation requires a separate 
search for each possible permutation of the set 
of key words. In searching for each permutation, 
one (12-bit) field of the auxiliary register is used 
to count the characters between the first key word 
and the last, according to the range specification. 
A second (11-bit) field contains each single char- 
acter from the text in turn, shifted in from the 
data register. This permits each text character _ 
to be compared against several characters with- 
out requiring complete word cycles for each 
comparison. Several comparisons are required 
because the various cells may be in different 
states with regard to the search (for example, 


owe HOLD THESE TR 
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EVIDENT THAT AL 
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Figure 2. Phase 1 Search Operation 


different key words may be the object of the cur- 
rent search, depending on the cell). The state 
of the operation within the cell is denoted by the 
setting of a 3-bit field of the auxiliary register. 


For each permutation of a set of k key words, 
Phase 2 requires 2k complete data register 
shifts. Each shift is about k-1 times slower than 
those in Phase 1 because of the aforementioned 
repeated comparisons. The total Phase 2 execu- 
tion time is therefore equivalent to 2k(k-1)k! word 
cycle operations. (In practice, k will rarely be 
greater than 4.) 


Phase 3 requires two data register shifts 
(i.e. , word cycle operations) for each matching 
key word set. Ata clock rate of 5 mHz, a word 
cycle requires 13.1 msec. If the ALAP clock 
frequency is 5 mHz, and if key words are six 
characters long, total search times for sets of 
two, three and four key words are 0.6 sec, 

1.7? sec, and 9.7 sec respectively, including a 
20 percent overhead for flag shifting operations. 
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PERFORMING SUMMATION AND PRODUCT IN AN 
ASSOCIATIVE PROCESSOR 


I-Ngo Chen 
Department of Computing Science 
The University of Alberta 
Edmonton, Alberta, Canada 


Summary 

It is well known that successive operation on a 
set of n numbers (e.g. the summation or the prod- 
uct of n numbers) required log,n steps for para- 
llel processings no matter how many processors 
are assumed to be available. For an associative 
processor employing bit-sequential-word-paral lel 
operation [1], the number of steps required for 
the summation of n numbers each of lengtli m is 


1 
(Flogyn + m) vlog,n bbe) 


For the product, the number of steps required 
will be 


if the general shift-and-add multiplication ts 
employed. 


In this short paper, we present two procedures 
which would reduce the bounds of Eqs. (1) and 
(2), if the associative memory is sufficiently 
large. We shall thus assume: 

1. that the associative memory has at least 
n words and that the word length is 
sufficiently long; 

2. that there is a data-manipulator [2] 
which can perform some simple data mani- 
pulating functions [3] like 


shift 

En(X) - take all even elements of the 
vector X 

Od(X) - take all odd elements of the 
vector X; 


3. that for simplicity, all numbers are 
positive integers; 

4, that there is a bit-slice full adder (be 
it hardware or routine like in STARAN 
[4]) which takes a bit-slice of augend 
Ai, a bit-slice of addend B., anda bit- 
slice of previous carry C. and gives as 


i 
outputs, a bit-slice of earny C. anda 
bit-slice of sum bit S.. i.e. 'We shall 
have Fy and Fo such that 
= F 
eye a Pied, ae (3) 
S. = F.(A., B.,C. ,) 


To perform summation of n numbers each of 
length m, first we divide the numbers into 2 
equal parts A! and B! as shown in Fig. 1. Next 
we read a bit-slice of Al. and a bit-slice of 
BI. to the full adder and'write the odd and the 
even elements of the sum S!. ressectively into 
Az. and Be ees 


= ad(S!) 


= En(s'.) ee ea eee 


oo 
| 
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ait = od (s/,) 
In general j4] j 
B. = En (S ) 


for j = 1,2,...,k, where k is the least 
integer greater than or equal to log,n. 
As depicted in Fig. 1, the number Of steps 
required is 


m + 2logyn Pie Saeeae pau (4) 
while the number of read and writes is 
n 
2 x 2(m +21og, 5) 
compared favorably to 


log,n (2m + log.n-1) reads 
Z 2 

and 

log,n (m + log,n+l1) writes 


2 \ 
as required by ordinary tree sum addition. 


Fig. | Lay-out of operands in 
memory 
Multiplication of two numbers 
va Pr a and RT genial 


can be performed by the following summation 


m : ; 
= Pol ‘ bel, 
T orate "q, (ror. peers) +2 ri(qi ys 4qyy 
er (5) 

Let 

Q= athe ry pee ery) 

Rear tay yeep) 

pi= y 27° +(Q.4R,) 

yy j 
j=l 
Then Eq. (5) can be rewritten as 
T=Pm. 

ele 7 
But P.=2 (Q,+R.)4P._, G(Q.,R.,P._,)... (6) 
and T.s the on bit of T, can be obtained at the 


ith iteration of Eq.(6) for i from 1 up to m. 
In fact, LF is the ith bit of Pi. So we may write 


TG (0.855? 24) Sh eetauen (7) 


The similarity of Eqs. (6) and (7) to that of 
Eq.(3) allows us to perform successive multipli- 


cation the same way as performing successive 


addition. For product of n number each of length 


m, the total number of steps required will be 
meme logon. 
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INTRODUCTION 
CHoPP 


This paper describes certain aspects of software support for CHoPP 
(Columbia Homogeneous Parallel Processor), a large scale MIMD ma- 
chine which has been under study by our group for almost two years 
[1,2,3,4]. CHoPP is intended to speed up digital computing by use of 
parallelism; its applications are not restricted to any special class of 
algorithms, nor limited to numerical analysis, or to non-numeric compu- 
tation. The goal of speed up is common to most proposals for parallel 
computers. However the idea of a single machine capable of achieving 
such speed up for practically all of the mainstream algorithms of digital 
computing is not usually to be found in previous and contemporaneous 
proposals and implementations. CHoPP’s approach consists of support- 
ing vast amounts of parallelism at much lower levels of hardware and 
software than has heretofore been attempted in MIMD machines. At 
this low level, parallelism is to be found in virtually every program, and 
this parallelism may be exploited to speed up computation. 


There exists a large and growing literature on the complexity of parallel! 
algorithms, [5,6] which shows that significant parallel speed up may be 
obtained for the great majority of algorithms of computer science. In 
some cases, the speed up available is proportional to the number of 
processors employed. These theoretical results have been confirmed [7] 
and extended [8] by analysis of typical FORTRAN programs, and by 
measurements, using trace techniques, of the run-time parallelism avail- 
able in such typical programs. The speed up calculations in the previous 
work do not. usually take into account any of the ‘overhead’ functions 
which contribute to the execution time in a practical multiprocessor. 
Specifically, no allowance is made for time required to assign a processor 
to a task, or for time required to communicate results from one task to 
another. Moreover, it is assumed that tasks are scheduled, in accordance 
with the constraints of the parallel algorithms presented, without any 
delay. The parallelism” identified in this work is characterized by very 
short tasks, which execute for a few instructions, and then terminate, 
transmitting results to some other task. Implicit in these analyses is a 
model of a parallel processor as an MIMD machine in which negligible 
time is required for assignment and reassignment of processors (task 
switching) and negligible time is required for communication of results. 


cess’. However the term has been used in various ways — for example 
C.MMP has been termed [9] a tightly coupled processor, as compared to 
the ARPA network of processors. The sense in which we would like to 
use the term is very different from such usage. We have therefore avoid- 
ed “tightly coupled” perhaps at the expense of some circumlocution. 


*This form of parallelism is often called a ‘tightly coupled pro- | 
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Most of the previous designs [10,11,12] consist of an assemblage of 
conventional computers, minicomputers, or microcomputers intercon- 
nected to form some kind of cooperating system. Although a variety of 
interconnection schemes and memory structures have been tried [14], 
and sophisticated operating systems devised [15], the existing and pro- 
posed systems fall far short of supporting the kind of parallelism des- 
cribed in the last paragraph. 


As a consequence, a central theme of multiprocessor research has been 
[12,13] the effort to discover parallel programs whose peculiar proper- 
ties permit them to be executed with low task interaction, and in- 
frequent task switching. These endeavors lead away from general pur- 
pose computing, to the identification of special applications. This search 
has thus far produced no large or interesting collection of suitable 
algorithms. Some special algorithms have indeed been uncovered [22], 
but it has yet to be shown that these have any bearing on the mainstream 
problems of digital computing. 


CHoPP is intended to support the form of parallelism generally available 
in computer programs. As such, it is designed to provide a good approx- 
imation to the idealized model of parallel multiprocessing described 
above. This necessitates (among other things) that task switching and 
intertask communication time both be reduced by several orders of 
magnitude, as compared with existing practice. Fo approach this level of 
performance, we have found it necessary to revise current notions of 
almost every aspect of multiprocessor architecture, in the processor 
design, in the memory to processor interconnection network, and in the 
software support structure. The resulting architecture bears little rela- 
tionship to that of conventional multiprocessor designs. Some of its 
structures are more reminiscent of high speed sequential machines and 
vector processors, in that they rely on extensive pipelining and high 
bandwidth interleaved memory. 


The Virtual Hardware Machine 


The hardware implementation of CHoPP has been described elsewhere 
[1,3,16]; for the purpose of this paper, it is sufficient to explain the 
virtual hardware machine which this implementation provides for use by 
the system programmer. This machine runs the operating system and 
executes the language support software. The facilities available on this 
machine determine the ease with which systems’ programs may be 
written, and together with the timing of operations, determine the 
overall speed of execution of system programs. The specification of the 
virtual hardware machine includes all facilities and timing available to the 
system programmer, but excludes structures which are implemented in 
firmware. and hardware. Thus the hardware virtual machine is just that 
machine which is usually described in a computer users manual, but not 
the one described in the maintenance manual. 


Figure 1 shows a block diagram of CHoPP at the virtual hardware level. 
There are three main elements: a number (N) of identical processors, a 
shared memory, and an interrupt system which permits one kind of 
interprocessor communication. The processors are general purpose 
computers. In preliminary designs a word length of 32 bits appears 
appropriate; however, such details are beyond the scope of this paper. 
The memory for these computers is a single, shared entity, organized into 
a single address space, and accessable, in toto, by every processor on an 
equal basis. Using conventional semiconductor technology, the speed of 
each processor will be approximately 200,000 instructions per second, 
and it is realistic to implement a system with approximately 1000 
processors, such a machine would operate at 200 MIPS. There is nothing 
in the hardware design, nor, as we shall see, in the software structure, to 
limit the number of processors implemented. 


INTERRUPT SYSTEM 


N Processors 


SHARED MEMORY 


Fig. 1. Virtual Hardware Machine 


The key, element in the system shown in Fig. 1. is the common shared 
memory. All processors access the memory independently and concur- 
rently. Conflicting references are resolved by the underlying hardware 
without loss of efficiency and effectively without any loss of time. 
Consequently there is no need to provide copies even when code or data 
is shared by a large number of processors. The most important aspect of 
this shared memory is that communication of code and data from one 
processor to the other is accomplished by passing an address pointer. 
This can be done in one instruction time, independent of the length of 
the material being transmitted. The common, shared, conflict free 
memory is thus the mechanism by which the intertask communication 
requirements (described in the previous section) are met. Ultimately, the 
shared memory is also the basis for achieving the rapid context switching 
required for the support of parallel processes in the general purpose en- 
vironment. Whenever a new task is to be initiated, it can be sent to any 
desired processor by the execution.of a single instruction, which contains 
an appropriate pointer to a block of shared memory. Of course, addi- 
tional mechanisms are required to schedule and synchronize parallel 
tasks. But for all of these, a key factor is intertask communication, 
which makes possible the transmission of arbitrarily long messages, with 
negligible overhead. The implementation of the memory processor 
structures which accomplish the performance just described, is the princi- 
ple hardware innovation of CHoPP. This implementation is efficient and 
economical for practically unlimited numbers of processors. 


The Problem of Control and Organization 


The high performance hardware structure of CHoPP does not, by itself, 
provide any assurance of high system performance. Efficient techniques 
for organization and control of parallel tasks are required. The problem 
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may be stated thus: some mechanism is needed which assures that, 
whenever tasks to be run are waiting, they will be assigned to available 
processors without delay. In conventional multiprocessors, this organi- 
zation is provided by the operating system; in CHoPP it is the function 
of analogous software called the nonde kernel, whose operation is sup- 
ported by hardware primatives. The importance of this kind of software 
has been pointed out [14] by Enslow. In the conventional multipro- 
cessor, it might be hoped that system performance improves materially 
with the addition of a processor, provided the number of processors is 
sufficiently small. This is not the case; Enslow reports an example where 
throughput increased by a factor of 1.8 for a two processor system, and 
by a factor of only 2.1 for a three processor system. He attributes this 
non-linearity, in part, to the operating system. 


Amdahl, as quoted [17] in Computer World, goes further. ‘The need 
for control and coordination software .. .[results] in a situation where 
by the time you got to four processors you actually had less performance 
than with three.” . 


The problem of control of parallel processes has two sides. From the 
standpoint of the user, there are instructions which invoke parallelism 
and provide for intertask communication. From the standpoint of the 
system; there are structures which interpret the user instructions and 
carry out the intended parallel execution. The relative roles of the user 
Janguage and of the system support differ in various approaches to 
parallelism. In CHoPP, the node kernel does actual assignment of pro- 
cessors, and (with extensive hardware support) mediates communication 
between tasks. These activities are performed in response to the user 
program, which contains instructions invoking parallel processing. To 
understand the node kernel, it is first necessary to briefly consider the 
form of language constructs. 


In CHoPP the application program contains the calls for parallel execu- 
tion of code. When the user determines that some sequence of instruc- 
tions can be executed concureently with the rest of the program, he 
identifies this sequence as a task. When the program is run, the node 
kernel will schedule the task on one of the processors. The user view 
of CHoPP is thus similar to that of some multitasking systems (for exam- 
ple UNIX) which permit user invocation of parallel processes, and to 
that presented by operating system languages such as Concurrent Pascal. 


In order to control the sequence of tasks, and insure synchronization, the 
user must indicate, where appropriate, that a task should wait until 
results generated by another task are ready. The user need not concern 
himself with the time at which the result will be ready. Scheduling of 
processors is the responsibility of the system, and if a task is suspended, 
waiting for results, the user can expect that the processor on which it 
has been running will be reassigned to another runnable task until the 
results are ready. 


We may think of the user as programming a virtual user machine. The 
code for this user machine is the high level language implemented at the 
installation. For definiteness, we may think of an ALGOL-like language. 
The user machine has an unlimited number of processors. Whenever a 
new task is called, a processor starts its execution; when a task stops, the 
processor goes back into the infinite resevoir of processors. In the high 
level language which represents the user machine, the only constructs re- 
quired to support parallelism are CALL P TASK, and a pair of synchron- 
iZing constructs such as WAIT result and SIGNAL result. Here result is 
the unique name for the data which has been generated by the task. 
When a task needs a result, the programmer uses aWAIT statement in the 
code of that task, at the point where the result is required. This will 
cause the task to suspend until the result is ready. Whenever a task gen- 
erates a result needed by another task, the programmer uses SIGNAL re- 
sult which trahsmits the pointer result and restarts the waiting task. The 
CALL P tASK statement (which we have borrowed from PL 1) simply 
indicates that the code P should be run in parallel with the rest of the 
program. As seen by the user, this causes a-processor to start executing 
P immediately. 


In the user machine, the constructs CALL TASK, WAIT and SIGNAL, 
cause tasks to run and to stop at various times. Between execution of 
one of these statements, the tasks run completely asynchrononously. 
However, no task starts until it is called, and every task waits for the 
specific results (or synchronizing signal) which it requires. Thus the ex- 
ecution of tasks is rigidly controlled by the statements in the language 
which support parallelism. Taken together, these statements, as they 
appear in any particular program, constitute a schedule for segments of 
executable code. Another way of expressing the same idea is that the 
parallelism statements in any program construct a precedence graph, 
which governs the order of execution of code segments. This schedule, 
or precedence graph arises naturally from the algorithm; and provides 
the (partial) order for the execution of tasks in CHoPP. 


In CHoPP, the operations of WAIT and SIGNAL, as well as some other 
synchronizing monitors, are implemented in the hardware and firmware 
of the processors. The execution time of these primitive then becomes 
that of a typical single instruction, and need not concern us here. How- 
ever, since we are interested in implementing languages which support 
recursion, and in systems which incorporate virtual memory, the trans- 
lation of symbolic addresses, such as result must be accomplished at 
run time. This translation, which is a function of the operating system 
and language support software in sequential machines, is a function of 
the node kernel in CHoPP, and will be described further on. 


In normal programming for CHoPP, many more tasks are invoked than 
there are real processors to execute them. The user therefore knows 
that in the real machine the execution of a task or task activation will 
normally be deferred for some time, until a processor is available to 
execute it. But for CHoPP he can expect that 1) All but a negligible 
number of processors will be busy, whenever tasks to execute are avail- 
able 2) The overhead associated with a task call will be about the same 
as that of a subroutine call in a sequential machine 3) No time will be 
lost in moving programs, data or results from one processor to another. 


We shall describe the three functions of the node kernel which make 
possible this kind of performance. The first of these is processor alloca- 
tion. When a task running in one of the processors executes a CALL 
TASK statement, a processor must be assigned to the new task. Because 
many tasks are running in parallel, many CALL TASK statements may be 
executed concurrently. The key to efficient execution is that the activi- 
ties required to generate the new tasks and assign the processors be car- 
ried out fully in parallel, without central programs or central tables. 
The second function is that of memory management. New tasks will, 
in general, require memory space in which to run. This space must be 
assigned from a general pool. When space is released, it must be returned 
to the pool. But, as before, the management of memory space must be 
carried out fully in parallel. Otherwise, creation and deletion of tasks 
would depend on sequential mechanism. Such a mechanism will have of 
course, some limited capability. And when many tasks request its ser- 
vice, the resulting bottleneck may stop the whole machine. The third 
function is that of reference resolution. We have described the trans- 
mission of information between tasks, using synchronizing constructs 
(e.g. WAIT and SIGNAL) which are hardware supported. The data or 
results which are transmitted by these statements are referenced sym- 
bolically by this programmer in simple languages, like FORTRAN, fhe 
symbolic references are translated to machine addresses at compile time. 
But in the general case which the CHoPP ooperating system supports 
this is not possible. Tasks are created at run time, often as a result of 
computation. Moreover, the language which CHoPP supports will per- 
mit recursion, as the most natural and efficient way of creating new 
tasks. Under these circumstances (as is well known) the symbolic ref- 
erences in the program must be translated at run time, into machine ad- 
dresses. Again, this activity must be performed in parallel, by many 
processors, without reference to central tables. 


THE NODE KERNEL 


The node kernel! is software that performs the operating system func- 
tions, such as allocating tasks to processes, allocating memory to tasks, 
and mediating intertask communication. In a parallel computer, these 
functions cannot be accomplished by a single processor as this would 
require all other processors to queue up to obtain services, leaving them 
idle almost all the time. Each operating system function must be distri- 
buted among all the processors, so each processor must have its own 
kernel. tn CHoPP, the kernels that run at each processor are identical. 
Each kernel behaves as an autonomous program and no processor kernel 
assumes a master or control role. All node kernels may be run in parallel 
and there is no hierarchical structural relationship between them. We 
will show how some of the operating system functions can be carried 
out by a large number of node kernels running in parallel. 


Task Allocation 


The first function that we will consider is the allocation of tasks to pro- 
cessors. In CHoPP we run any task at any node since the applications 
programmer cannot know ahead of time what processors will be avail- 
able. Further, in a machine designed to use programming languages 
that support recursion, the number of tasks to be performed is, in 
general, data dependent, hence the applications programmer does not 
know that number of tasks to be assigned, let alone to which processor 
to assign them. This function must be carried out by the node kernels 
at run time. A task may be created by either of two methods, as a job 
that has been entered into the system at a node, or spawned by an 
existing task already at a node. !n either case the node where the task 
is created is not necessarity the node where the task will eventually be 
run. 


In sequential machines that support multi-tasking, the operating system 
determines the assignment of processors to tasks in the following way 
(this description closely follows Denning [18]. A task manager, part 
of the operating system, maniputlates two queues and a task list. .The 
task list is merely the collection of all activation records (called ‘‘state- 
words” by Denning). The activation record contains the task’s unique 
index number, the location in memory associated with the task, the in- 
struction pointer, and registers. It thus consists of all information ne- 
cessary to initiate the execution of a task. The two queues maintained 
by the task manager are the queue of ready tasks and the queue of tasks 
waiting for some event to occur. Entries in each queue consist of point- 
ers to appropriate activation records. Whenever the running task ter- 
minates, a new task is taken from the ready queue by the task manager, 
which initiates its running. When a running task is blocked, waiting for 
an event to occur, the task manager removes it from the processor, 
places it in the queue of tasks that are blocked, and initiates execution of 
a task from its ready queue. When an event takes place which unblocks 
a blocked task, the task manager puts this task in the ready queue. 


This is a description of.a multitask, single processor system. Each node 
of CHoPP will be run in this manner. When aCALL TASK statement is 
executed by a program running at, a node, a trap to the node kernel is 
generated in the hardware. The response of the node kernel is to con- 
struct an activation record and place the pointer to this activation record 
in its ready queue. For high utilization of CHoPP, a mechanism is 
needed to assure that no processor is ever idle. The only circumstance 
under which a processor may be idle is if its ready queue is empty when 
the task it is running terminates or suspends. To eliminate the possibility 
of any node having an empty ready queue while other nodes have ready 
queues with more than one task waiting, the node kernel must equalize 
the length of its ready queue with the ready queues of the other nodes 
in the machine. We call this mechanism queue balancing, which is ac- 
complished as follows: 
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1: When a processor creates a task, it selects at random another pro- 
cessor and assigns this task to it. The receiving node kernel places the 
task in its ready queue. 


2. Whenever the number of runnable tasks in the queue at a node 
either increases as a result of step 1 above, or decreases as a result of a 
task leaving the queue to be run, the node kernel selects another node 
at random to perform balancing. That is, enough runnable tasks are 
sent from the node with the longer queue to other node to equalize 
their lengths. | 


The performance of the algorithm just described is measured by the ex- 
pected number of processors whose ready queues are empty. These 
processors may not be idle since they may still be running a task; there- 
fore this measurement provides a lower bound on the efficiency of 
the algorithm. A computer simulation was carried out to determine, 
for a 256 node CHoPP the expected number of empty ready queues. 
The results of this simulation are shown in Fig 2., where the expected 
fraction of empty ready queues is plotted against mean queue lengths, Ss, 
The upper curve shows application of 
step 1 only of the algorithm. The lower curves show the application of 
step 2. These results indicate that the algorithm is entirely adequate 
whenever the number of tasks exceeds the number of processors by a 
reasonable factor. 


As CHoPP is presently conceived, no further improvement on the algo- 
rithm is necessary, from the standpoint of efficiency. For the record, 
we observe that the following modification (which we were considering 
before the results of Fig. 2 were available) will further decrease the num- 
ber of idle processors. Assume an otherwise idle node will periodically 
select another node at random and balance its queue with it. The other- 
wise idle node will continue this procedure until it obtains one or more 
tasks to run. This will have the effect of further reducing the mean 
queue length for 100% utilization. 


It should be pointed out that the activation record itselt is kept in the 


memory. When a runnable task is passed from one narig kernel’s queue 
to another, only a pointer to the activation record is passed, not the 
record itself. 


Memory Management 


The conventional structure for memory management in system design 
has been to maintain a central list of page frames for available storage. 
It is not feasible in CHoPP, (where memory is managed by the kernel in 
each processor), to have multiple memory managers updating a central 
table simultaneously because sequential access produces unacceptable 
bottlenecks. Just as in the case of task assignment it is necessary to 
devise techniques to distribute not only the management of the list 
but the list, itself, equally over all nodes. 


The conventional structure for memory management in system design 
has been to maintain a central list of page frames for available storage. 
It is not feasible in CHoPP, where memory managers updating a central 
table simultaneously because sequential access produces unacceptable 
bottlenecks. Just as in the case of task assignment it is necessary to 
devise techniques to distribute not only the management of the list but 
the list, itself, equally over all nodes. 


When a task is created, it will require memory space assignment. This 
will be controlled by the memory manager at the node kernel where the 
execution of the task begins. Recall that memory in the CHoPP system 
is a single address space memory equally accessable by every node. All 
nodes can concurrently access every word of memory. The CHoPP 


memory is organized in page frames. The size of the page frame will be 
considerably smaller than that for an IBM system. The reason for this 
is that the expected segment size is small, perhaps 32 words, as opposed 
to 16K in IBM systems. This kind of segment size has been experienced 
in Burroughs Machines and Multics [19]. For a discussion of the relation 
between segment size and page frame size, see Brinch Hansen [20]. 


Pointers to page frames are transferred from processor to processor 
during memory assignment. Initially, all nodes have the same number of 
page frame pointers. When a task is initiated, the memory it needs will 
be obtained from the list of available page frames at the initiating node. 
It is desirable to maintain equal numbers of page frames at every node as 
long as memory is available in the machine. If any node discovers it 


_ has too litthke memory to meet the demands of its running tasks, then it 


must invoke some overflow protection mechanism or virtual memory. 
However, these mechanisms should not be triggered unless there is, in 
fact, a global lack of memory. 


Hence, the two functions of the memory manager are the maintenance 
of even length lists of available page frames, and the detection of global 
memory overflow. The principle on which the memory manager con- 
trolls its list of page frames is analogous to the methods used for queue 
balancing by the task manager. When a task releases page frames, these 
are added to the local list of available page frames at the node. Whena 
task requests page frames, the request is filled, if possible, from the 
local node’s list of available page frames. If insufficient page frames 
are available to satisfy any request, the memory manager selects another 
node at random, and acquires page frames, repeating this process as 
often as necessary to fill the requirement. 

In order to achieve the goal of maintaining about the same number of 
available page frames at each node, a balancing operation is used. The 
memory manager in each node maintains a quantity, the page frame 
estimator, (PFE) which, at all times, is an estimate of the average number 
of page frames in all nodes. Whenever the number of available page 
frames at a node changes, for any reason, the memory manager compares 
the number of available page frames, with the PFE. If the difference 
between these two quantities exceeds a preset initiation limit (which 
may depend on the PFE), a balancing operation takes place. This bal- 
ancing operation consists of selecting, at random, another node, and 
equalizing the number of page frames in the two nodes. The balancing | 
operation is repeated until the difference between the number of pages in 
the node and the PFE is within a preset acceptance limit (which may 
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also depend on the PFE) Note that there are two preset values involved, 
namely the initiation limit, which determines the point at which the bal- 


ancing operation is initiated and the acceptance limit, which causes the 
balancing operation to stop. By adjusting the relative size of these limits, 
the performance of the system may be modified. 


Each time that a memory manager acquires paae frames fram another 
node, or balances with it,the length of the list of available page frames in 
the other node is noted The list Jengths thus accumulated constitute 
a random sample of the average list length in the whole machine. There- 
fore, the average of these measurements constitutes a statistic which is 
an estimator of this average list length. The average of the measurements 
of the length of lists in'‘other nodes is the PFE. (Of course, each memory 
manager maintains its own PFE). By adjusting the initiation limit and 
the acceptance limit, the designer can assure that the lists throughout 
the machine are equalized, to within any previously specified tolerance. 
The mechanism just described for determining the PFE is a sequential 
sampling technique for estimating the mean of a distribution, and as such 
it requires a minimum number of balancing operations to achieve a de- 
sired tolerance, with a fixed (predetermined) confidence limit. 


In any case, the local PFE provides at each node a reliable estimator 
for the amount of available memory in every other node, and therefore 
in the whole machine. Note that this estimate has been derived without 
any central structures which might produce a bottleneck, and essentially 
as a bi-product of the basic memory allocation structure. When the total 
amount of memory remaining in the machine has decreased below some 
preset level, the mechanism described above will produce an excessive 
number of search operations whenever a node requires memory. At this 
level, the memory is effectively full. When the PFE is dropped below 
this predetermined lower limit, at any node, the overflow mechanisms 
ot the machine should be invoked. In this way the second function of 
the memory manager has been implemented. 


Reference Resolution: Distribution of Central Tables 


As we have seen previously the central tables required by an operating 
system must be handled differently in CHoPP. The two techniques 
we have seen for task and memory allocation are not, however suffi- 
cient for all such tables. We now give a third technique which is useful 
for a variety of purposes. To describe this structure and show how it is 
used, let us use the example of the table containing buffer information 
used for intertask communication. In a sequential processor, when a 
task needs information about a buffer, it sends its request to the operat- 
ing system, which searches for the buffer name in a central table. Upon 
locating the name, the operating system retrieves the requested informa 
tion and sends it to the task. 


t 


} In CHoPP, the buffer name for each buffer is hashed. The resulting hash 
is interpreted as the address of a node. Each node of CHoPP will main- 
tain the portion of the table that is hashed to it. Thus, instead of one 
large central table, CHoPP will have one small table at each node and 
each buffer name will occur in exactly one of these small tables. When 
a task needs information on a given buffer, it sends its request to its 
local node kernel. This node kernel hashes the buffer name to find the 
address of the node where the requested information will be found. A 
message is sent to the kernel at the node to obtain the information. 
Upon’ receiving that message, the information is obtained and sent back 
to the task‘s local node kernel, which then relays it to the task. 


Any table can be distributed among the nodes of CHoPP by hashing the 
search key. The number of accesses to any table at any time is never 
greater than the number of nodes. Hence, if the hash gives a reasonably 
uniform distribution of node addresses from a table’s search key, then 
the queue of requests at any node will never be unacceptably long. 


Discussion and Conclusions 


The hardware basis of CHoPP provides a new memory/processor rela- 
tionship in which large numbers of processors can concurrently access a 
large common high speed memory, without incurring penalties due to 
conflicting references. This permits an unprecedented level (for an 
MIMD machine) of intertask communication, and facilitates rapid con- 
text switching. 


In this paper, three important aspects of the ChoPP software support 
have been discussed. The theme that runs through the operating system 
design is decentralization of control functions and of tables. !n each of 
the examples presented, the tables and control functions have been 
distributed in such a way that every node processor shares equally in 
the operating system task. Moreover all functions are performed autono- 
mously yet cooperation is achieved. By analogy with the tessalated auto- 
mata of Von Neumann, we call this system ‘‘self-organizing”’. 


Thus the focus of our research on operating system techniques is de- 
centralization of function. But, at the same time, it becomes clear that 
a sophisticated operating system (the node kernel) is required, and that 
it will bring with it a number of “overhead” functions. We would main- 
tain that these functions are an inevitable concomitant of the control 
of parallel tasks. Our goal is not to eliminate these functions, for we 
believe that such an attempt would be futile but rather to find the means 
of providing them in such a way that they do not interfere with process- 
ing. This is accomplished when there are no shared control mechanisms, 
whose limited Capacity can cause resources to be idle, while waiting for 
some service function. We expect that, within tolerable limits, every 
processor in CHoPP will always be kept busy. We also expect that some 
large fraction of the time of each processor (perhaps 30-50%, as in some 
existing sequential machines) will be spent in overhead functions. 


This philosophy also conditions our CHoPP hardware design. In a 
parallel processor, liardware support (at the instruction level) of task 
generation and synchronizing primatives are just as important as branch 
instructions in a sequential machine. 

’ 

The achievement of conflict free parallel access memory seems to require 
complex and elaborate interconnection circuits. All of these factors 
increase the cost of the machine, so that a CHoPP configuration of N 
processors, will cost more (we now estimate by about 50%) than N 
separate processors of the same general capability. This is the price 
paid for the speed up execution of algorithms in a general purpose con- 
text. 


This increased hardware complexity, dedicated to the support of parallel- 
ism, provides a benefit in reducing the running time of the operating 
system functions described in this paper. The principle factor in node 
kernel overhead is the execution, at each node, of a conventional multi- 
tasking system on behalf of the tasks managed by the local node kernel 
Primatives specifically oriented toward improving the efficiency of 
execution of parallelism will make the node kernel functions, including 
multitasking, significantly more efficient. 


We end by a mention of the origins of some of the concepts in this 
paper. In one or another form, these concepts incorporate techniques 
for distributing central tables. The germ of this idea goes back at least 
as far as the original paper on C.MMP [13], where it is pointed out that 
dividing global tables into smaller tables will improve the efficiency of 
parallel processors. The node kernel programs maintain tables, and 
assure synchronization by permitting only a single request to be active’ 


at any time. Such programs are monitors as described by Hoare [21] 
and Brinch Hansen [20]. 


It will be noted that some aspects of the node kernel which 
were described in a previous paper differs from the present description, 
which we consider an improvement. 
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HIGH LEVEL LANGUAGE CONSTRUCTS IN A SELF-ORGANIZING 
PARALLEL PROCESSOR 
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T.R. Bashkow, D. Klappholz 
Dept. of Electrical Engineering and Computer Science 
Columbia, University, New York, New York 


SUMMARY 


CHoPP [1,2] is an architecture for MIMD parallel processors 
intended to support programs which consist of many short tasks. User 
written programs on hundreds to thousands of processors will typically 
run for less than one hundred instructions, and then be suspended, 
awaiting some results generated by another task or be deleted and 
replaced. Although the vast majority of parallel! algorithms described in 
the literature [3] operate in this manner, CHoPP is the first MIMD 
architecture oriented toward such a high degree of task switching and 
task interaction. 

The basic framework for an appropriate language is generally 
understood. Consider an ALGOL-like language with all the constructs 
necessary for sequential programming. To this we add CALL P TASK 
(borrowed from PL/I) which is used to initiate the parallel execution of 
a procedure P. Since tasks run asynchronously in CHoPP (as in any 
multitask machine), we need constructs for coordinating their activities. 
Suppose a task XY needs a result generated by a task Y. Then, at the 
point where the result is needed, the code for Y will contain the state- 
ment WAIT result. At the point just after the result has been generated, 
the code for Y will contain the statement SIGNAL result. 

The WAIT statement causes X to suspend execution until Y 
executes the SIGNAL statement; thus the activities of X and Y are syn- 
chronized. These three constructs CALL TASK, WAIT, and SIGNAL, 
when used in conjunction with the control statements in the language, 
are adequate to express parallelism. (Of course, in a practical language, a 
richer vocabulary would be provided). Adequate rules for the control of 
access to variables are these: private variables may be used freely inside 
a task; shared variables must always be guarded by a semaphore, or, 
more generally, accessed through a monitor [4,5] . 

A program for creating large numbers of tasks may be written 
by placing CALL (P 1) TASK inside a DO loop where | is the index of the 
loop. This will cause a single processor to sequentially create the tasks. 
In CHoPP since task creation is a major activity, many processors must 
concurrently create tasks: otherwise. many processors would be idje 
much of the time waiting for a new task, and the purpose of parallelism 
would be vitiated. An algorithm in which a task creates two or more 
tasks, each of which in turn create two or more tasks, etc., will imple- 
ment the parallel creation of tasks, and create N tasks in order log N 
time. The natural program for this algorithm uses recursion. It consists 
simply of a procedure P which contains two or more CALL P TASK 
statements. Note that P is calling itself as a parallel task; of course, each 
successive activation of P will have different parameters. This algorithm 
generates a tree of tasks, which we call the ‘spawning tree’. Note the 
unexpected role of recursion in the parallel program. Instead of being an 
attractive but inefficient feature, as in sequential languages, it is the 
natural way of achieving efficient task generation. The convenience of 
recursion comes as an added benefit. 

Every program in CHoPP starts as a single task in one processor 
and spreads through the machine using the mechanism just described. 
This does away with structures found in many previous multiprocessors, 
namely, hardware for broadcasting tasks, a “host” machine which gener- 
ates the broadcasts, and the sequential scheduling programs that run in 
host machines. 
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Run time creation of tasks is not only required for efficiency, as 
just explained, but also for the execution of many kinds of algorithms in 
which It is not know, prior to execution, which parallel tasks will be 
needed or, often, even how many tasks will be generated. Transmission 
of data from one task to another does not, necessarily follow the spawn- 
ing tree; in other words, communication takes place between tasks which 
are not necessarily related as parent/child by task generation. How is 
this communication to be expressed, in the parallel language? 


The natural way to express communication between tasks Is by 
using a common name for each shared variable. This is consistant with 
usual language conventions for communicating data between various 
parts of sequential programs. To support recursion in the sequential 
case, it is necessary to resolve name references, at run time, for each 
activation of a subroutine. This is implemented by consulting a stack. 
To support the more general type of communication between parallel 
task activations, we suggest that it is necessary to permit run time com- 
putation of names. When the same name for some variable is computed 
by two tasks, that variable value may then be transmitted between 
them. (Synchronization, as described above, must also be established.) 
The functions necessary to match computed names, and supply the re- 
quired pointers, are similar to those implemented in virtual memory 
systems for resolving page references. For CHoPP, a mechanization of 
such functions has been devised which employs no central tables or 
central control mechanisms. [7] 
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SUMMARY 


The CHoPP architecture [1] is a radical departure from previous 
MIMD machines, in that it is intended to support parallel tasks of ex- 
tremely short duration. CHoPP is therefore designed to switch tasks, 
reassign processors, etc. in a few microseconds. Functions which hereto- 
fore have been regarded as fundamentally global in nature, such as pro- 
cessor scheduling and memory management, are accomplished without 
central contro! mechanisms or central tables [2]. To support this kind 
of activity the architecture of CHoPP employs many techniques reminis- 
cent of SIMD architectures. Like conventional modern large scale com- 
puters, and like vector processors, it employs extensive pipelining tech- 
niques and a very high bandwith, shared, interleaved memory. 

CHoPP hardware implementation embodies three basic con- 
cepts: . 

1) CHoPP consists physically of N identical nodes, each of 
which contains a processor, a memory bank, a switch ele- 
ment. 

2) CHoPP nodes are connected by a high capacity multi- 
stage network in the form of a binary k cube. Each node is 
at the corner of this k cube and there must be 9k =N 
nodes. Each node therefore has a unique address k bits 
long. 

3) All memory banks are interleaved to form a single main 
memory which can be accessed by any node. The maxi- 
mum distance from any processor to any memory bank is 
k, so that for a 64K processor, for example, this maximum 
distance is 16. 

In this machine a memory access may be initiated by each 
processor every machine cycle. Memory accesses are accomplished in the 
following way. The processor assembles a packet giving its own address 

(source), the d esired memory address (destination node memory bank 
plus displacement within the bank.) and an operation (fetch instruction, 
read data, write data, etc.). The message is then transmitted by relaying 
from node-to-node until the memory is reached. The memory access is 
made and a return packet is sent to the requesting processor. Thus in the 
worst case there is a delay of 2k communication steps before the initia- 
ting processor completes the memory reference operation. In addition, 


there is the actual memory access time, plus possible queueing delays at, 


intermediate nodes. This total delay may be called the latency time. 

In order to ameliorate this effect, each node processor is, in 
fact, a multi-tasking machine. It sends out packets not for a single task, 
but for as many tasks as it needs to, in order to do some instructioh pro- 
cessing at each machine cycle. We define a machine cycle as the time 
required to: 


@® inspect any returning packet 
@® accomplish the operation (add, subtract, etc.) required 
by this packet. 
@ = initiate a new packet. 
How many tasks must it actively be running in order to keep 
busy in this fashion? It must be running as many, on the average, as the 
‘average latency time requires. One can think of this as a pipelining tech- 
nique in which the depth of the pipe is not fixed but which grows to 
just exactly the size required to satisfy the requirement that each pro- 
cessor is doing some instruction processing at each machine cycle. 
The architecture, as just described, has the remarkable property 
that the CHoPP machine executes a fixed number of instructions per unit 
time regardless of this latency time as will be shown below. 


We will now discuss the parameters controlling this perform- 
ance. Clearly, if | know the machine cycle time and the latency time 
(average) | know the number of tasks (average) which must be active. 
However, this assumes that: 


@ there are always enough new tasks available to be initiated 

. aS running tasks. | 

@ the network bandwidth and the memory bandwidth are | 

adequate to support this level of packet transmission. 

To see that the instruction execution time is fixed, consider the 
following simplified examples. Suppose that each task requires 100 ma- 
chine cycles. (Each instruction requires some small number .of machine 
cycles). After the packet for the first task is initiated, a packet for a 
second task. is initiated, then a third, etc., until a response is received for 
the 1st task. We can then expect a steady state condition in which a 
packet is received and initiated on each machine cycle. Suppose the 
latency time is L microseconds, and the machine cycle is L/10 micro- 
seconds. Then we will initiate L/(L/10) = 10 tasks which will run to 
completion in about 100 L microseconds. 

lf we somehow reduce the latency time to L/2 microseconds, 
we will then initiate (L/2)/L/10 = 5 tasks which will take 100 (L/2) = 50 
L microseconds to run to completion. Therefore we will initiate 5 more 
tasks which will complete in the next 50 L microseconds. Again we 
have run 10 tasks in 100 L microseconds. As indicated earlier the speed 
of the machine is fixed regardless of the latency time. 

To see what this means, we will make some estimates in order to 
see what performance can be expected from CHoPP. To continue with 
our simplified discussion we assume an old fashioned 2 memory cycles 
per instruction machine, thus 2 of the machine cycles are required for 
each instruction. With present day circuitry we can conservatively allow 
500 nanoseconds per machine cycle or 1 microsecond per instruction, 
giving us a 1 MIP machine per node. The latency time, L, consists of 2 
components, the actual memory cycle time Tm and the mean communi- 
cation time delay in the network Tn. 


Tm we can reasonably estimate as 500 nanoseconds. Tn is the 
product of the mean number of nodes transversed, which will be k/2 
going toward a memory plus k/2 returning for a total of k, and the mean 
waiting time at anode W. Queueing analysis shows that W is of the order 
of one processor machine cycle. Thus we expect the mean latency, L to 
be 500 nanoseconds (1 +k) and the mean number of active tasks, Ni, to 
be: 500 nanoseconds (1+k) | (1 +k) tasks 

500 nanoseconds. 

Table 1 shows these parameters for various sizes of machine. It 
can be seen that both latency and task queue length are reasonable even 
for the 64K machine because, of course, these values only increase logar- 
ithmically with N. Moreover, the introduction of a local cache mem- 
ory, which reduces memory accesses across the network, or in effect 
reduces L, can be used to reduce these numbers substantially. 


TABLE 1 
k 4 7 10 13 16 
Ne 16 128 1024 8K 64K 
L 2.5 4 5.5 7 8.5 
fi 5 8. 4 14 17 
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ABSTRACT 


This paper describes a multiprocessing 
system using conventional microprocessors which 
are dynamically restructured to get the desired 
word width and dynamically reconfigured to obtain 
the desired memory height. The system is space 
shared so that several tasks concurrently execute 
in different blocks of the partitioned resources, 
and communication is provided between and within 
task blocks. The dynamic reconfiguration and 
restructuring of the different processors and 
memory modules and interprocessor and I/O 
communications are achieved using an intercon- 
nection network called an SW-banyan, the cost 
of which is proportional to n Ln n, where n is 
the number of modules to be interconnected. A 
great deal of flexibility and power is available 
by means of an inexpensive SW-banyan switching 
network. The user has unprecedented capabilities 
to configure and structure the machine to fit his 
problem. When the problem allows it, very high 
memory word width and thus bandwidth is 
obtainable, so that memory is more effectively 
used; but when word width is unwanted, as in 
string manipulation, it can be efficiently 
limited. This architecture therefore, seems very 
well suited to "scientific" processing 
requirements of the future. 


l. Introduction 


The cost effectiveness of the microproces- 
sors opens the most exciting and challenging 
research area of using an assemblage of micro- 
processors to achieve the capabilities of large 
machines. Some rather well researched techniques 
for this include the reconfigurable multiproces- 
sor approach and SIMD or vector approach. The 
variable structure approach and the techniques 
for communications between the cooperating 
processors such as data flow techniques are also 
under investigation. These will be described in 
the following paragraphs. 


The reconfigurable architecture uses a 
cross-point switch to connect resources, like 
memory modules and I/O devices to the processors. 
These include the RW-400 [13] and the C-mmp [20]. 
We will use the term "configure" in this sense: 
to connect resources to the processor to use it 
more effectively. The SIMD or vector architec- 
ture uses several data operators or ALUs 
commanded by the same controller to execute the 
same operation on each data contained in each 
ALU. Early SIMD machines were associative 
memories because the simplest data operator is 
the comparator of an associative memory. More 
recently, the n-bit-sliced microprocessors like 
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the Intel 3000 [8] and AM 2901 appear to make 
more cost-effective data operators. 


A very similar concept to SIMD, but not so 
well researched, is the varistructure concept. 
Such an architecture offers the prospect of 
dynamically coupling the n-bit-sliced micro- 
processors and their associated memory to get 
the desired word-width as well as structure the 
system for vector or array processing applica- 
tions. We will use the term "structure" in this 
sense: to modify the- apparent structure of the 
memory to utilize it more effectively. The 
first reference the varistructure concept 
appears in a paper by Estrin [7] which suggests 
a fixed part of the processing unit and a 
variable part consisting of some registers and 
functional units. Estrin suggested that the 
variable part of the machine could be used to 
expand the word-width. However, he did not 
suggest any cost-effective way to do this. 


More recently other investigators have 
suggested varistructure architectures in terms 
of dynamically coupling the n-bit-sliced micro- 
processors in the system to fit the desired data 
structures and word-widths [3,9,12,26]. The first 
attempt to couple such microprocessors, by 
Lipovski [9], used a fast tree structure, which 
was difficult to schedule. In the architecture 
proposed by Okada, Tajima and Mori [12] the 
processing units are connected to their left and 
right neighbors. Each processor i has data 
paths to processors itl, i+2, ... i+2". However 
this architecture puts the restriction that the 
processors coupled to make the desired word- 
width processing unit be physically adjacent to 
each other. Another varistructure architecture 
was proposed by Lipovski [10] in which proces- 
sors communicate with their left and right 
neighbors but the communication is actually done 
by a carry-look-ahead-tree-like structure to 
achieve speed and fail-soft capabilities. Again 
the processors coupled to work on bytes of a 
word must be physically adjacent to each other. 
These architectures are suitable for only a 
small number of processing elements, or for large 
numbers of elements if the operation is CPU 
bound, because their 1/0 capabilities signifi- 
cantly degrade as the number of processing 
elements in the array is made large. 


In any computing system where a number of 
independent processors concurrently and separ- 
ately execute their own tasks (i.e. when the 
resources are space-shared) the possibility for 
them to. cooperate depends on interprocessor 
communication. We will use the term "communi- 
cate" in this sense: to provide means between 


possibly independent processors to coordinate 
tasks and pass data between them. Generally, 
such "processors" can be extended to include 
input/output devices. Communication is fundamen- 
tally different from structuring or configuring. 
Structuring and configuring relate intimately to 
the fetch-execute cycle where as communication 
relates indirectly to it (e.g. like input/output 
operation). Either data or control or both are 
communicated. One of the early key papers by 
Conway [5 | showed how 'Fork' and 'Join' can 
specify the way independent processors can be 
utilized, when available, to solve a problem. 
Control communication is necessary to coordinate 
this type of operation. More recently, Dennis' 
data flow techniques [6] show how data implicitly 
carries control, so that data communication can 
be used to coordinate independent processors. 


The design of a flexible cost-effective 
interconnection network for configuring, 
structuring, and providing communication between 
processing modules forms the core of an efficient 
reconfigurable varistructure system of multiple 
microprocessors. We are encouraged by the 
discovery that all the three functions can be 
obtained by the same inexpensive interconnection 
network. The interconnection network used in the 
proposed machine is called banyan network [15] 
and has cost-function proportional to n Ln n as 
compared to n2 of a cross-point switch, where 
n is the number of processors and memory modules 
to be interconnected. The cost advantage of 
n Ln n is obtained at the expense of reduced 
total interconnection capacity. Preliminary 
measurements indicate that in the order of 10% of 
the possible interconnection structures are not 
connectable (i.e. they are blocked) [16]. There- 
fore these networks are blocking networks [4]. 
The objective of this paper is to show an 
architecture that effectively utilizes an SW- 
banyan for structuring, configuring and providing 
communication between processing modules. 


The next section shows how the resources are 
interconnected in the SW-banyan. Its objective 
is to show how simple and modular the hardware 
is. The following section shows how the fetch 
execute cycle and interprocessor communication 
function. It shows how simple is the control of 
the processors and switch. Software is briefly 
discussed next. The final section summarizes our 
results and points to further work. 

2. Static Structure of the Architecture 

The SW-banyan interconnection structure is 
defined and its control is explained in the first 
section. The internal organization of resources-— 
-processors, memory and I/O is discussed next. 
Finally, interconnections to affect structuring, 
configuring and communication are delineated. 

2.1 Interconnection Structure 


The interconnection network used in this 
architecture belongs to the class of banyan 
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networks which were proposed by Goke [16] for 
partitioning multiprocessor systems. For the 
sake of completeness some of the important 
properties of these networks, relevant to the 
proposed architecture are briefly reviewed here. 


A banyan can be defined as a particular 
type of directed graph. A base in the banyan is 
defined as a vertex having no arcs incident into 
it and an apex is any vertex having no arcs 
incident out from it and all the other vertices 
are called intermediates. The graphical 
representation of a banyan is a Hasse diagram of 
a partial ordering (i.e. an irreflexive, 
asymetric, intransitive graph) such that there is 
one and only one path from any base to any apex. 
Larger banyans can be synthesized from the 
smaller banyan modules such that the resultant 
network holds this property. A regular banyan 
is one in which the number of arcs incident into 
each vertex and incident out from each vertex are 
constants called fan-out (F) and spread (S) 
respectively. Our proposed machine uses the SW- 
banyan [15] which can be obtained by recursively 
expanding a cross bar structure. It can also be 
obtained by replicating a tree. Draw an L-level 
tree with fan-out F anc with root at the top. 
Then replicate the top (root-ward) branches § 
times. Then replicate all the structure above 
the next level of nodes S times. Continue 
replicating at each lower level of nodes until 
the bottom node is reached. Note that the 


resulting SW-banyan is quite regular in three 
See Figure l(a). 


dimensions. 


Figure l(a) - Three dimensional 
view of an SW-banyan 
F=8=3, L=2 


Interconnection of devices by means of a 
banyan generally follows this strategy. Devices 
to be explicitly connected are attached to base 
nodes only. (In this paper, some devices are 
attached to apex nodes, but these are implicitly 
connected by the strategy given below.) The set 
of the devices is partitioned, and a bus-like 
interconnection for each block of the partition 
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Figure l(b) -— An SW-banyan L=3, S=F=2 


is to be set up in the switch. This bus-like 
interconnection is actually graphically a tree of 
bidirectional amplifiers such that any device on 
the leaf of the tree can broadcast data to all 
the leaves of the tree "instantaneously". 

Several trees may be set up in an SW-banyan for 
each block to provide independent bus-like inter- 
connections for them. Setting up a tree having 
n leaves can be done in one step using four 
separate control lines, however it is easier to 
understand using n+2 steps and two control lines. 
Assume that other trees already utilize some arcs 
and nodes, and some arcs or nodes are known to be 
faulty. In the first i steps, i= 1, 2, ...n 
the ith selected leaf node broadcasts a signal 
towards the apex, which is blocked if it 
encounters a used or faulty node or arc. Each 
node records the resulting signal it gets. An 
apex node is a potential tree root if it gets 
signal from each selected leaf of the tree. In 
step ntl, exactly one potential root is selected 
by a priority circuit. In step nt2 the selected 
root broadcasts downwards as selected leaves 
broadcast upwards; any arc getting a signal from 
the selected root and any one of the selected 
leaves becomes connected to form the tree. This 
algorithm is converted by DeMorgan's law into & 
one step algorithm. 


We choose the SW-banyan because it is 
capable of implementing all the three types of 
interconnection strategies -- structuring, 
configuring and communication -- as we will soon 
show. The shared bus and shared memory tech- 
nique characterized by Anderson and Jensen [1 } 
would certainly experience the contention 
problems and excessive delays which would limit 
the size of the system. Store and forward 
switching like Pierce loop [14] and "perfect 
shuffle" [18] are useful for packet switching 
but not for bus like operations, so they would 
not be suitable for structuring and configuring 
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by means of the fetch execute cycle operations in 
a varistructure processor. The only flexible 
interconnection structure known before the banyan 
was the crosspoint switch which is too expensive 
for large systems. A banyan is flexible as a 
crosspoint switch and is easier to control in 
hardware. It makes the best interconnection 
structure for structuring as it can make bus like 
interconnections and carry-look-ahead links whose 
delay is proportional to the log (number of 
processors). Two banyan structures, the SW and 
the CC banyans, are known [16]. The SW-banyan 
switch is chosen for interconnection network 
because it is easy to expand such a switch into 
larger banyan switch without rewiring the inner 
structure. 


In this varistructure architecture,n-bit- 
sliced microprocessors are connected to the 
apexes, and the memory modules or I/O devices 
are connected to the bases of the banyan network 
as shown in Figure 4(a). Each link of the banyan 
network contains, fox instance, 8 data lines, 

16 address lines, 4 control lines for search and 
set-up and three lines for carry-—look-ahead 
functions. Each node has a carry-look-ahead 
logic circuit fabricated from standard TTL gates 
as shown in Figure 2a. The data lines and 
address lines in a link will be used to connect a 
tree, which will be used like a data bus or 
address bus between processor and memory. They 
can be fabricated from standard integrated 
circuits like DM7833 quad transceivers [21] or 
cD4066 [22] analog switches. However a specially 
designed digital bidirectional amplifier, shown 
in Figure 2b,[19,25] would be preferred for this 
purpose for two reasons. It has higher noise 
immunity as compared to CD4066 analog switch. It 
does not require control signal to direct data 
through the tree as compared to DM/833, which 
requires additional logic to enable the tristate 
gate in the direction of data flow through the 
switch. G ! 
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Figure 2(a)Carry-Look-Ahead tree at a node 
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Figure 2(>)A Bidirectional Amplifier 


2.2 Processor, Memory and I/O Modules 


We specify the structure of the processors, 
memory and I/O modules to show what the SW- 
banyan will connect together. The processor is, 
by design, a rather conventional microprocessor. 
Special memory is required so that a physical 
page of memory can be assigned different page 
numbers, in order that a collection of memory 
modules can be freely assigned to store pages of 
data required by a task. The logic is distri- 
buted, so that each physical page has just the 
required logic to associate different page 
numbers to it. This technique is essential to 
both configuring and structuring memory 
efficiently. As a bonus, the logic is so similar 
to that required for virtual memory that we find 
it quite simple to incorporate it. Input/output 
connections are provided to page data into and 
out of memory modules at high data rates, and a 
secondary memory is implemented to store pages 
efficiently. Since a directory creates 
attendant problems in a reconfigurable, vari- 
structure machine and since the memory utiliza- 
tion may well be sparse (i.e. a task may use 
pages 0 to 5, 20 to 27 and 47) the secondary 
memory should be intelligent enough to store. 
these efficiently without any external directory. 
We discuss the processor first, then we describe 
the memory and secondary memory modules in more 
detail. 


All the processors are connected at the 
apexes of the network. These processors are 
conventional byte slice microprocessors as shown 
in Figure 3(a). It has a ROM for storing the 
microinstructions, an ALU, a program counter and 
an instruction register. As in most micropro- 
cessors it sends 16 bit addresses to the memory 
modules and I/O modules, sends or receives 8 bit 
data bytes to or from them. However, it also 
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provides external accesses to its carry-look- 
ahead logic (generate, propagate and carry) and 
a memory cycle state signal (indicating instruc- 
tion fetch, data recall or data memorize). A111 
the processors in the system are identical and 
should fit in an LSI chip. : 
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Figure 3(a) -— Processor Module 


The memory module consists of a RAM, for 
example (lk x 8 bits) and support logic to imple- 
ment a virtual memory system as described in [2]. 
The module contains a six bit page-number 
register. The higher order six bits of the 
address are compared with this page register. 

If the match is successful in a memory module 
then the lower order ten bits are used to address 
the RAM in that module. Memory modules are 
attached to the base nodes of the banyan. A 
number of memory modules will be connected to a 
processor by means of a tree formed within the 
banyan, and each memory module will have a 
different value in its page register. When a 
processor sends an address to all the memory 
modules through the tree, only one of the memory 
modules connected to this processor has page 
number register matching with the higher order 
six bits of the address. See figure 3b. 


This memory module can easily be extended to 
support virtual memory. If none of the modules 
match the page number then one of the pages has 
to be swapped out. For this purpose support 
hardware (such as age-use-counter and dirty bit) 
can be provided on the virtual memory (VM) module. 


Execution of a task requires a set of 
secondary memory modules. For back up storage 
a self managing secondary memory SMSM [11] 
fabricated from charge-coupled devices or magne- 
tic bubble memories is attached to an I/O port. 
This memory is capable of storing variable length 
records along with a label, and the record can 


be accessed using this label. More signifi- 
cantly, however, the hardware associated with 
SMSM can search for the record by its label and 
can delete,input or output data from that record. 
Sparce pages can be efficiently stored using the 
label as a page number. However, data described 
by capabilities and data in stacks can also be 
stored on SMSM without maintaining a directory 
for the devices. Some other I/O ports are 
connected to peripheral devices, such as tele- 
types, to communicate with the external world, 
or the controller of the banyan network. 
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Figure 3(b) - Memory Module 
2.3 Instruction and Data Trees 


A task execution will generally require 
some memory modules and I/O devices to be 
connected to some available processors. A tree 
structure, called the data tree,is created in the 
SW banyan network with these memory modules and 
I/O ports as the leaves and the processor as the 
root. This is shown in Figure 4(a). This tree 
is set-up using the four control lines in the 
manner described in section 2.1 for search and 
set-up. The processors are indistinguishable; 
therefore we first choose the memory and I/0 
ports to be connected and then connect them to 
any available processor. This way the probabil- 
ity of encountering the blocked paths is reduced. 


A programmer may decide to use more than 
one byte of precision (say p bytes) and possibly 
he may decide to operate on vector of n elements 
in SIMD mode. Execution of such a task would 
require p x n processors to be connected to their 
respective memory modules using separate data 


trees. Each separate data tree is set up 

as in the previous paragraph. In the same 
banyan now one inverted tree is formed as 
pointed out by Goke [17], which connects 

the processors which were the roots first 
chosen for the data trees. In this inverted 
tree, the processors are leaves 
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Figure 4(a) - Data trees in the SW-banyan 


and an unused memory or I/O module is the root. 
This tree will serve for instruction transmission 
and linking the carries as discussed in section 
3. The carries are broken by the processors, by 
forcing propagate and generate to zero in the 
carry-look-ahead circuits, after every p proces- 
sors. The carry circuit can be connected so as 
to either propagate to the left or to the right 
for left shift/right shift operations or 
propagate carries for addition. Thus n proces- 
sing units are formed each of which operates on 
word of length p x n bytes, and all of them 
execute the same instruction as transmitted on 
the instruction tree. The instruction tree is 
shown in Figure 4(b) and the unfolded instruction 
and data trees are shown in Figure 4(c). In the 
dynamic structure, i.e. during every fetch- 
execute cycle, connection of the instruction 
trees will be made only during the fetch cycle 

to get the instruction to each processor. 


2.4 Memory Sharing Trees 


The data and instruction trees discussed in 
section 2.3 are used essentially as busses to 
connect memory and I/O to processor and to 
connect processors together. That is, the 
amplifiers in the links of these trees are essen- 
tially permanently on. A different tree is now 
introduced to share memory. Oriented like an 
instruction tree, the leaves of the memory-shar- 
ing tree are processors that are selected to 
share a page of memory and its root is a memory 
module. An active chain, from the memory module 
at the root to a processor at one of the leaves 
is established from time to time. The amplifier 
links in the active chain are turned on, and the 
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Figure 4(b) - Instruction trees in the SW-banyan 


active chain is effectively appended to the data 
tree of the processor and the end of the chain. 
The other links of the memory-sharing tree are 
not used for data paths, but are available to 
form active chains when another is to be estab- 
lished. Note that the data trees, even while 
augmented by the active chains, are mutually 
exclusive. The fetch-execute cycle will be 
defined using these mutually exclusive data trees. 
The machinery to extablish the active chains, an 
arbiter, is identical to the carry-look-ahead 
logic used in instruction trees. When an 
instruction tree is created the machinery acts 
as a carry-look-ahead generator, and when a 
memory~Sharing tree is formed it acts as an 
arbiter. 


3. Dynamic Aspects of the Architecture 


Having presented the basic elements of the 
architectures we now discuss how they work. 
Specifically, the scheduler's actions, fetch- 
execute cycle, and memory-sharing mechanisms are 
considered. 


3.1 Scheduler Mechanisms 


The user specifies the desired precision 
and vector size by means of dimension declara- 
tions or it is determined by default. The 
scheduler will try to set up the process when 
the required resources are available. Data trees 
are created one at a time; then an instruction 
tree is created from the processors selected by 
the data trees as discussed in section 2.1. 
Failure to connect the data trees or the instruc- 
tion trees would abort the process. A single 
carry linkage is provided in the instruction 
tree. Some of the processors are flagged to. 
always break the carry linkage by setting 
generate and propagate to zero and the page 
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Figure 4(c) - Unfolded data trees 
and Instruction trees 


numbers are inserted into the virtual memory 
modules so that each data tree has necessary 
pages for each byte-slice of data. Program 
memory is inserted anywhere in the'data trees so 
that each page of the program appears just once 
in a data tree that is connected to the 
instruction tree. Finally the memory is loaded 
from the secondary memory and the process is 
started. 
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Figure 5(a) - Programmer's View 


This is elaborated further with the help of 
an example. Here the user requires two pages of 
double precision numbers in a three element 
vector. The programmer's view of the machine is 
given in Figure 5(a). Execution of this task 
requires six processors and fifteen memory 
modules three of which are used for storing 
programs. In the machine these modules are 
connected as shown in Figure 5(b). This example 
will be continued in the next section. 
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Figure 5(b) - Configuration of switch 


vs 


3.2 Fetch-Execute Cycle 


The fetch-execute cycle operates in terms 
of data and the instruction trees set up by the 
scheduler. Consider execution of instruction 
ADD 7 which is stored in page number 3. The 
programmer views this machine in terms of 
Figure 5(a) where the instruction from page 3 is 
sent to all the processors. We now show how the 
instruction is fetched from page 3 to all the 
processors and how data in each processor is 
independently and concurrently recalled. 


In the machine, at the beginning of the 
fetch cycle, all the processors present the same 
address to their memory modules, as the program 
counters in all the processors have identical 
values. Only one memory module matches the 
address and pulls out the instructions from its 
memory which is sent to all the processors 
through the data and instruction trees. These 
load the words into their instruction registers, 
decode and execute the instructions. 


In this example the instruction requires 
word 7, which is on page 0, to be recalled. The 
programmer views in terms of Figure 5(a) where 
the multiprecision vector word in page 0 is 


recalled to the corresponding processors. The 
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address 7 is simultaneously generated by each of 
the processors and transmitted on their respec- 
tive data trees. One of the virtual memory 
modules in each data tree,with page number 0, 
matches this address and sends back the word on 
the data tree. These six bytes of data are 
recalled into six processors via the six disjoint 
data trees assigned to this task. The instruc- 
tion tree is disconnected for this cycle. 


The processors appear to add the word 
recalled from memory into their accumulators, as 
suggested by Figure 5(a). The instruction is 
executed and the carries propagate through the 
carry-look-ahead circuit created in the instruc- 
tion tree. At the word boundaries, identified 
by flags set, the carries are inhibited by 
forcing propagation and genrate to zero, e.g. in 
the processors second, fourth and sixth from the 
right. 


3.3 Shared Memory 


When several processor request the shared- 
memory module in the same cycle then the carry- 
look-ahead circuit is used as an arbiter to grant 
only one of the processors access to the shared 
memory. The priorities can be assigned in a 
fixed way or can be moved on a round-robin 
fashion, as we now show. 


All the processors except one have propagate 
equal to 1 and any processor requesting shared 
memory makes generate equal to 1. The carry 
output is connected to the carry input to effect 
"end-around carry". 


By moving this propagate equal to zero 
position dynamically among the processors sharing 
the memory module, round-robin priority disci- 
pline is implemented. If at any processor 
carry-in = 1 is detected then it means that 
another processor which is to its right is 
requesting the shared memory; therefore its 
request is inhibited. The processor requesting 
the shared memory and having carry-in = 0 is 
granted access to the shared memory. This proces~ 
sor sends an address to the shared memory and if 
the page number matches then the active chain is 
established between the processor and memory. 

If a page is unavailable then the processor 
would wait for a few cycles and try again. If a 
processor's request is just inhibited because of 
carry-in = 1 then it will try again in the next 
cycle. Once access is granted, by means of the 
active chain, the shared memory appears to be in 
the data tree of the processor that was granted 
access. It can address this shared memory in a 
fetch-execute cycle as defined in section 3.2. 
When the processor no longer needs to access the 
shared memory, it releases it so that another 
processor requesting it can be granted access. 
It is expected that once a processor is granted 
access to a shared memory page, it will use it 
for several tens of cycles before releasing it. 


Shared memory provides a mechanism for indi- 
visible operations like test and set, because a 
processor is expected to have complete control of 
a shared memory module for several cycles. We 
propose that control communication use only this 
mechanism. Processors expecting a control signal 
will continually check a shared memory module. 
Unexpected control information might be communi- 
cated by writing all software so that it perio- 
dically checks a shared memory module. The round- 
robin arbiter mechanism for accessing shared 
memory should give each processor its opportunity 
to access its control signals. However, some 
form of interrupt may be necessary, to alert a 
processor to look at its control signals. This 
problem is now being studied. 


4, Preliminary Remarks on Software 


Although possibilities abound for using com- 
plex, powerful software, some very simple soft- 
ware techniques should be easily implemented to 
take considerable advantage of this machine. 

One technique might be to schedule independent 
four byte wide tasks for all users. Space sharing 
this machine in this simple way is superior to 
time sharing a conventional large machine, since 
the scheduler need not find consecutively nun- 
bered pages to store a program (the switch can 
allocate any collection memory modules to a data 
tree) and since the operating system need not 
bother to keep processors busy through swapping 
programs in and out (the individual partitions 

can be allowed to be idle since they form only a 
small fraction of the system resources). More- 
over, better utilization of memory and of I/0 
resources should be feasible, since, resources can 
be carefully assigned to the processors that need 
them. Finally, fail-soft operation and memory 
protection can be easily obtained. 


Next, it should be easy to write compilers 
where the instruction set is fixed but the object 
processor word width is selectable at compile 
time. If a program is heavily character string 
oriented, the entire program can be compiled for 
a one byte wide processor. A program that uses 
a lot of 64 bit numbers could be compiled for an 
eight byte wide processor. Although multi- 
precision do-loops are not completely eliminated 
(the program compiled for a one byte wide pro- 
cessor may have to operate on occasional 3 byte 
numbers) their frequency may be reduced. Stan- 
dard programming languages can be used. New 
languages that have vector capabilities. (like APL) 
can also be written to use the SIMD features of 
this machine. Herein, the scheduler gets rela- 
tively fixed requests for processors and other 
resources, which it can effectively handle. 
Multiple precision Do-loops and vector Do-loops 
should be substantially reduced to reduce the 
overhead in processing large numeric programs in 
particular. . 


It should be possible to set up pipelines of 
otherwise independent partitions of the resources 
to increase throughput. For instance, a compiler 
can be partitioned into a lexical analyser, a 
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parser, etc., and these can be put in different 
partitions of the machine. Memory sharing can 
be used to pass data and control between these 
partitions. More generally, some simple forms of 
data flow programming at the "subroutine level" 
might be easy to implement. This simple utili- 
zation of multiprocessing should be quite effec- 
tive in this flexible machine. 


The techniques above seem to be within the 
reach of current software technology. More 
exciting possibilities exist in this machine, 
however. As microprogrammers control multiple 
registers and busses in parallel through horizon- 
tal microprogramming techniques, they may control 
complete computing partitions of this machine, 
some of which can be vector (SIMD) processors, 
in tightly coupled programs. Partitions might be 
joined together and broken apart to execute sub- 
tasks of a program. For instance, Mori et al 
[12,24] are considering implementing floating 
point operations in separate partitions of their 
machine-one for the exponent and one for the 
fraction. This may work well in this machine. 

We are studying the use of a separate partition to 
analyse descriptors for a partition operating as 

a vector machine [23]. One advantage of this 
concept is that the width of the vector machine 
can be assigned at run time, and that width can 
be different from what the user requested, yet 

the resources will be efficiently used. High 
level languages could be interpreted by similar 
partitions. 


This new kind of programming appears very 
exciting. By analogy, a composer writes a theme, 
an orchestrator divides the music for different 
instruments, and the conductor coordinates the 
music. We believe that exciting challenges lay 
ahead for the orchestration of partitions - the 
translation of algorithms already designed for 
conventional machines into parallel forms to be 
executed in different partitions, and the con- 
ducting of partitions - the operating system and 
hardware that keep the partitions synchronized. 
In addition to loosely coupled multitasking using 
fork and join, or their equivalent, horizontal 
microprogramming techniques will be used to 
tightly couple the partitions. Orchestration may 
become a new discipline in programming technology. 


It is unlikely that all programmers will be 
aware of orchestration. Rather, carefully orches- 
trated subroutines will be available to the 
average programmer. The programmer will simply 
write programs to call up these subroutines. The 
machine will be restructured and reconfigured as 
directed by the subroutines at run time. This 
kind of complex, powerful software may be able to 
take full advantage of the machine. 


It is not clear which level of advancement 
can justify the cost of this machine. Since simple 
identical processors and memory modules are used, 
this machine offers the possibility at the outset 
of being more cost-effective than large contem- 
porary machines because it can take advantage of 
LSI technology. However, the switch, though 


inexpensive in comparison to other switches, is 
still costly. It is our contention that by sim- 
plifying scheduling, reducing software overhead, 
providing pipelining and data flow control, we 
can obtain significant advantages over current 
designs. Moreover, it is our hope that orches- 
tration will provide unprecedented power to this 
machine. 
5. Conclusions 

A computer for scientific applications 
should parallel inexpensive microprocessors to 
cost-effectively achieve the throughput necessary 
to solve massive problems such as weather predic- 
tion, yet a number of small problems should be 
able to cost effectively space-share the assem- 
blage of microcomputers. Reconfigurability, 
variable structure and memory sharing are emmi- 
nently suited to this task. Although a switch, 
even a banyan with cost n.1n n is a rather expen- 
sive item and should not be used indiscriminantly, 
we have shown that the same banyan can provide 
reconfigurability, variable structure and memory 
sharing, with little effort. The user has the 
unprecedented freedom in specifying the apparent 
width and height of this memory, and can effec- 
tively manipulate one byte wide character strings 
or n element vectors whose elements have preci- 
sion p. Moreover, the scheduler has the capa- 
bility to interconnect any available processors, 
memories and I/O devices without restrictions, 
such as using contiguous cells in a chain, and 
very high input-output bandwidth is attainable 
by connecting an I/O device to each data tree. 
We confidently brandish our claim that this 
architecture is the best yet described for 
scientific applications. 
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Abstract -- The design of a spectal purpose com- 
puter for the numerical solution of problems in 
fluid mechantcs was discussed in meetings at RAND 
Corporation in 1976-77 [9]. The number of compu- 
tations requtred puts many important problems 
beyond the reach of the most advanced computers 
available today. Speedups attainable by techno- 
logical advances and software optimization do not 
help enough; large scale parallelism was deemed 
necessary. Finite difference methods break the 
continuous space of physical problems into dis- 
crete compartments iterated over a region. Equa- 
tions relate physical quantities in a small 
neighborhood of compartments. Since these equa- 
tions are identical over large regions, a design 
was proposed which consisted of an array of identi- 
cal microprocessor chips communicating with near- 
est neighbors. Each chip carries out the computa- 
tions expressed in the fintte difference equations 
corresponding to a single compartment and all 
chips are simultaneously active. This design can 
be viewed as a digital simulation of the 
physical system. Reasonable numerical parameters 
suggest arrays of 1002 chips, possibly in layers 
or internally organtzed to model three dimensional 
physical problems. The differences between this 
design and other parallel machines ts described. 
Novel organization ts made economically feasible 
by recent advances in Large Scale Integration (LSI) 
technology. 


A Cellular Array Computer for Fluid Dynamics 


There are many practical problems which 
require the accurate and rapid calculation of 
fluid-dynamic forces and flow phenomena. These 
include aircraft and ship design, weather fore- 
casting, and btologitcal modelling. Progress in 
providing advances in the quantitative analysis of 
these problems ts paced to a great extent by the 
computing power avatlable tn the stmulation of the 
flow phenomena of tnterest. The major computation 
in all cases involves solutton of the Navier-Stokes 
equations. Hence the solution of a very general 
class of problems would be greatly facilitated by 
the development of a spectal purpose Navier-Stokes 
computer, 

Increasing memory size permits the use of a 
finer spatial resolution for a fixed volume of 
fluid and/or the calculation of flows in larger 
volumes. The number of operations per time step 
increases at a slightly faster rate than the num- 
ber of mesh points (or modes, for spectral 
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methods), In fact, it its easily shown that if n° 


mesh points are used in a calculation the number 
of operations per time step is proportional to 
N3 kn oN for the best methods. 

While there have been some improvements in 
algorithms, progress has been tied to the develop- 
ment of ever faster general-purpose computers. 

The most recent 'super-computers" incorporate 
substantial amounts of pipelining and parallelism 
in their CPUs in addition to using faster process- 
ing elements, In short, high-speed computation 
has been sought via faster and more complex CPUs. 
It appears that all of today's "super-computers" 
were designed to handle many different classes of 
scientific computing problems. The cost of this 
versatility appears to be that these architectures 
are far from optimum for any class of problems. 

In fact there are many important problems for 
which the fastest existing computers are hopelessly 
slow. 

Recently, Dr. I. E, Sutherland [9] has sug- 
gested an architecture which appears to be much 
closer to optimum, for computational fluid dyna- 
mics, than that of existing "super-computers." 
This architecture is a two-dimensional array of 
LM cells, (see Figure 1). 


[asa 


S—Ciel, j+1 


? 


U0..00 


REGISTERS 


ONE CELL 


Figure 1: Schematic of the cell computer 


Each cell can communicate directly with its near- 
est neighbors "above" and "below" and to the 
"left" and "right ", A single cell is assumed to 
contain some memory, an adder and some registers, 
The data ts stored in memory. The registers are 
used as working memory to store intermediate 
results and the adder ts used to perform binary 
addition and multiplication by shift-and-add, 

It ts clear that local communicatton ts relatively 
cheap and long range communication may be prohi- 
bitively expensive because of cell-to-cell propa= 
gation. 

The potenttal advantages of this architecture 
are threefold: first, it appears possible to 
build such a computer using existing technology 
(each cell is only a few microprocessor chips, 
perhaps a single chtp) at fatrly modest cost, 
particularly tf the tntercellular connections are 
minimized; second, technological developments in 
the semtconductor industry, t.e,, advances in 
Large Scale Integratton (LSI) fabrication, are 
leading toward tncreasing complexity and density 
per chip and lower cost per chip; and third, if 
the array of cells can be mapped onto the fluid 
domain such that N2 operations can be performed 
in parallel on N2 chips, the number of sequential 
operations per time step can be reduced from 
O(N32ngN) to O(N no N). 

This architecture appears so promising that 
it its worthwhile to examine a test case. The test 
case considers the tncompressitble flow of a fluid 
in a boundary layer adjacent to a rigid, imperme- 
able, no-slip wall. | 

The objectives of this exercise are to deter- 
mine: how well a standard algorithm fits a cell 
computer; what modifications to the algorithm are 
required; which part, if any, of the algorithm 
dominates the calculation; which operation domi- 
nates the calculation; how many operations are 
required per time step; what is the memory 
requirement per cell; and what is the computation 
time per time step, using conservative estimates 
of the transfer, multiply, and add times. The 
computational region ts 0< x<x,0<y<~»®, 
O0<z<z_., The equations of motion are, of 
course, the Navier-Stokes equations for an 
incompressible fluid 


ju LL u: Vu- Wt Vu (1) 


Ve eu=0. (2) 


Here R = uo/v is the Reynolds number, where U 
is a charactertstic free stream speed, 6 is the 
boundary layer thickness, and v is the kinematic 
viscosity. The velocity, u = tut jv + kw, has 
components (u,v,w) in the x, y, and z directions 
(i, j, and k are unit vectors), and p is the 
pressure per untt density. A Poisson equation 
for the pressure can be obtained by taking the 
divergence of Eq. (1). | 

Appropriate boundary conditions for the 
velocity are also specified such as: the inflow 
velocity field on the plane x = 0 is given; the 
velocity a = 0 on the plane y = 0, etc. Neumann 
boundary conditions for the pressure field can be 
obtained by evaluating equation (1) on the bounda- 
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ries, In order to represent the tnfinite physical 
domain 0 < y < © in the finite computational 
domain a mapping fs used to transform the infinite 
domain (0 < y < ~) onto the finite domain 

(0 < € <1). It has been shown [10] that this 
mapping yields highly accurate results with rela- 
tively few grid points in those cases, as in this 


problem, where the flow field at infinity is a 


simple laminar flow. The only cost is that the 
metric coefficients must be stored in each cell. 

The physical space, 0<x<x,0<é€<1, 
O<z<z_ is divided into L + M -°N cells cen- 
tered on the points (iAhx, jA&, kAz) where it = 1, 
25 «any bh fo 1, 2h weey ME and kw 1, 25. sasg Ne 
The x component of the velocity is defined at the 
center of the front and back faces of the cell, 
165, Uso = u((i -— %)Ax, jA&, kAz, t). 

~é,j,k 


Similarly the € and z components of the velocity, 
v and w are defined at the centers of top and bot- 
tom and side faces of the cell. The pressure ts 
defined at point (i,j,k) in the center of the 
cell, 

A column is defined as all those fluid cells 
at constant x3 a row as all those fluid cells at 
constant €3 and a rod as all those fluid cells at 
constant z. It will be assumed that one rod of 
data, the (i,j) rod, ts stored in cell (i,j). 
This of course implies that cell (1,4) has some 
multiple of N words of memory plus suffictent 
memory for constants such as 1/R, the metric 
coefficients, etc, 

The spatial differencing scheme is centered. 
second order. It can be shown that this approxti- 
mation, when applied to Eqs. (1) and (2) with. 
appropriate boundary conditions is conservative 
in the sense that mass is conserved for any R > 
and, in the limit R > ~, momentum, energy, and 
enstrophy are also conserved. 

The time differencing is the Adams-Bashforth 


type. Let us define as ae as the righthand 


side of Eq. (1) defined at (subscripts) the point 
(idx, jAE, kAz) in space and at (superscript) 
time nAt. Ej and Fig els are defined in 


a completely analogous way. Then the Adams- 
Bashforth approximation to the time derivative is 


nt+1 n 


e ang : n-1 ) 
i-4,j,k i-4,j,k 


: ~ F, : 
»J,k i-s,j,k 
| (3) 


In order to advance ue and Dp by one time 


At n 
er Rte 


step it is necessary to calculate Pre (Fo rae 
yn n 1-4, j »k 


; Bee a 
i,j-k,k, (4k? and solve the Poisson equa 


“6 

tion for p. As one might expect the most diffi- 
cult part of the calculation is the solution of 
the Poisson equation with Neumann boundary condi- 
tions. This problem is elliptic, t.e., non-local. 
A direct solution, say by using an FFT in one 
direction and a tridiagonal equation solver in 
the other may be prohibitive because of the trans- 
fer time cost. More importantly, these methods 
are restricted to flows with very regular geome- 
try; general geometries require the use of 
relaxation methods. 

We have assumed that the relaxation method 


will be Red-Black SOR which requires only K para- 
llel iterations to converge tf SOR requires K - 1 
sequential tteratfons. This method has been 
simulated using a model problem, The simulation 
confirms the estimate. 

It should be noted that a "standard" relaxa- 
tion method has been assumed. We have not yet 
examined other methods such as cell by cell 
divergence-pressure method [12] or the use of 
block or multigrid relaxation methods [5], which 
could increase the convergence rate. 

The operation count for one time step can be 
obtained by writing the finite difference equa- 
tions, finding where the required data are 
stored, counting the number of transfers needed 
to get these data into the registers in cell 
(i,j) and the number and type of arithmetic 
operations required, As an example, figure 2 
shows cell (i,j) and its neighbors. All the 
data, and only the data, required to calculate 


n 

Fits.4,k? Fide k? and Fidoks are shown in 
this figure in the cells where they are stored. 
It can be seen that twenty words of data must be 
transferred to cell (1,j)}; of these twenty only 


two, Wi ads, j-1, k and Ven, 443s, k? require a two 


step transfer, 
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Data required to compute 


The number of in-cell transfers, cell-to- 
cell transfers, additions, and multiplications 
necessary to advance t and p by one time step are: 


IN CELL TRANSFERS = 48 N 

CELL TO CELL TRANSFERS = ( 63 + 6£K') N 
ADDITIONS = ( 208 + 8fK + 2 1In,N ) N 
MULTIPLICATIONS = ( 93 + 3£K + 4 1In,N ) N 
Number of words of memory = 18 + 12 N 
Number of registers = 30 


Here N is the number of mesh points in the 
transverse flow direction (z direction) and the 
product fK is the number of tterations for relaxa- 
tion of the Poisson equation. 

From an examination of this table, assuming 
£K = 50 and N = 64, say, several facts become 
apparent. If we take the in-cell transfer count 
as unity, the cell-to-cell transfer count is 
about 7, the addition count 14, and the multfpli- 
cation count is about 5, Because it ts generally 
true that the in-cell transfer time and the addi- 
tion time are considerably smaller than the cell- 
to-cell transfer time and the multiplication time, 
it is clear that the total calculation time ts 
controlled by the cell-to-cell transfer and 
multiplication counts and times. The ratio of the 
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TIME AND MEMORY ESTIMATES 


Total © 
Calculation 
Time for 
Calculation Calculation Calculation Calculation Total | One Time Step 
| Time for Time for Time for the Time for the /; Calculation on a Fast 
|Interior Points | Boundary Points| Pressure Field | Velocity Field | Time per Memory Conventional 
| per Time Step |. per Time Step | per Time Step j| per Time Step Time Step per Cell Computer 
Problem (sec) (sec) (sec) (sec) (sec) (words/bits) (sec) 
#1 i | 
1/3 million 0.060 0.027 0.076 0.087 4s2/~14e | 20.0 
point 
rrobiem 
#2 | 
22 ' 
1 miliion | 0.242 0.112 0.309 0.354 1584 /~50K 81.9 
point . 
problem 
#3 
Pee 0.242 0.112 0.309 0.354 1587/~SOK 839 
point 
problen 
"1K = 1024 TABLE 1 


multiplication count to cell-to-cell transfer 
count ts about one, so reducing either cell-to- 
cell transfer time or multiplication time to zero 
would result tn a speedup of only a factor of two, 

In order to gatn some insight into the per- 
formance capabilities of this type of array com- 
puter, three sample problems will be considered: 
(1) A (Clogical} array of 50 x 200 cells with 
N = 32; 50 points in the dtrection normal to the 
boundary, 200 in the downstream dtrection and 32 
in the cross-stream direction. It ts belteved 
that this "one-third of a million point" problem 
is the absolutely minimum problem of interest in 
fluid dynamftcs.,. (2) A (logical) array of 50 x 
200 cells with N = 128; 50 potnts normal to the 
boundary, 200 in the downstream direction and 128 
across the stream. The "million point" problem 
is quite interesting. (3) A (logical) array of 
100: x 1000 cells with N = 128; 100 points normal 
to the boundary, 1000 points in the downstream 
directton and 128 potnts across the stream, This 
"ten million point" problem is very interesting 
because tit allows study of important fluid flow 
phenomena hitherto beyond our reach. 

In order to calculate computation time per 
time step ft is necessary to make assumptions 
about the inecell transfer, cell-to-cell trans- 
fer, addition, and multiplication times as well 
as assume values for f and K. It will be assumed 
that: In-cell transfer time ts 100 nsec; Cell- 
to~cell transfer time ts 3 usec; Addition time is 
500 nsec; Multiplication time ts 5 usec; f is %5; 
and K ts 100. It will also be assumed that the 
memory words are 32 bits and the registers are 64 
bits long. | 

In order to compare the performance of this 
cell computer to a conventional computer, opera- 
tion counts and time estimates for the conven- 
tional computer are needed. The operation counts 
for a three-dimenstonal finite-difference tech- 
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given in column six. 


nique for a conventional computer are in [6]. In 
this technique, the spatial differencing is the 
same as used in the analysis of the cell computer 
performance, second-order central differences; 
the time differencing is Leap-Frog; and a Fast 
Poisson Solver is used, The downstream direction 
has L points, the cross-stream direction N points 
and the direction normal to the boundary has M 
points, It will be assumed that, for the conven- 
tional computer: Addition time is 100 nsec; 
Multiplication time is 200 nsec; and Memory trans- 
fer time is 1 usec. 

The time estimates for one time step are 
given in Table 1 for each of the three problems. 
These estimates are broken down in several ways 
(all times are in seconds). Column one lists the 
problems. In the second column the time to 
advance the interior points one time step tis 
given, while the third column is the time required 
to advance the boundary points one time step. The 
sum of these is the total time required to advance 
all the grid points one time step, and this is 
This total time is also the 
sum of the time required to compute the pressure 
field, given in column four; and the time required 
to compute the velocity field, column five, The 
total number of bits of memory required for each 
problem is listed in column seven. Finally, 
column eight is the total time required to calcu- 
late one time step on a conventional computer. 

From an examination of Table 1 it its clear 
that most of the computation time of the array 
computer is spent computing the pressure field 
for the interior points of the grid; the ratio 
of the calculation time for the interior to the 
calculation time for the boundary fs about 2, 
while the ratio of the calculation time for the 
pressure field to that of the calculation time 
for the velocity field ts about 7. This holds 
true for all three problems. In short, the 


dominant part of the calculation is the relaxation 
solution of the pressure field on the interior of 
the grid. 

Although tt its not shown in this table, pre- 
cisely the opposite holds true for the calcula- 
tion on the fast conventtonal computer using a 
fast Potsson solver, wherein the ratio of the 
velocity fteld calculation time to the pressure 
field calculation time ts about 3. 

For problems #1 and #2, the ratio of total 
calculation time on the conventional computer to 
that of the array computer ts ~230. Increasing 
the number of grid points in the cross-stream 
direction from 32 to 128 changes this ratio by 
less than 1 percent. However, increasing the num- 
ber of cells from 104 to 10° (problem #3) 
increases the ratio to about 2370. As was expec- 
ted, the ten-fold tncrease itn the number of cells 
is fully reflected in the speed ratio. Note that 
for problems #1 and #2 the speedup due to paral- 
lelism reduces a week of computation time on a 
sequential computer to under one hour. For prob- 
lem #3, a week ts reduced to about five minutes. 
Thus calculattons whose times and costs were pro- 
hibitive on existing computers could be used 
routinely as standard tools by researchers in 
fluid dynamics and designers of vehicles such as 
aircraft and ships. 

The problem with 32 grid points in the cross- 
stream, z, ditrection requires 432 words of memory, 
about 14K bits per cell. Increasing the number of 
grid points tn the z dtrection to 128 increases 
the size of the required cell memory to 1584 words 
or about 50K bits. Problem #1 fits nicely into a 
16K bit memory and problems #2 and #3 are good 
fits to a 64K bit memory per cell. 


Overview: Relevance to Parallel Processing 


Far fewer parallel computers have been built 


than. sequential computers because of organizational 


complexity and economic barriers in the past, 
Processors were expensive, and iteration of compo- 
nents implted tterated costs, plus control over- 
head. Recent LSI advances have removed the 
economic barrier for component iteration; once 
design and set-up are patd for, chip reproduction 
cost is vanishingly small. There has been a 
corresponding scarcity of parallel processing 
research in the computer sctence literature, 
though recently such publications are increasing, 
Let us now briefly survey the parallel processing 
literature, relating tmportant results to the 
proposed Navier-Stokes computer design, The short 
bibliography here points to major sources ranging 
over a broad spectrum of research areas, leading 
to most of the important work which has been 
carried out since the sixttes., Research in paral- 
lel processing will be divided into the following 
areas; construction of parallel computers, pro- 
gramming languages for parallel computers, paral- 
lel evaluation of ordinary arftthmetic expressions 
parallel numerical algorithms and parallel gram- 
mars in formal language theory. These are dis- 
cussed in order below. 

The cellular design proposed here shares 
some characteristics with several existing paral- 
lel machines but differs itn ways which suggest 
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far greater computing power for the intended 
application, The existing machines, compared 
below with out design, are described in detail 
[26] and [27], 

ILLIAC IV can be viewed as an 8 x 8 array of 
cells (PE's) arranged in a grid with communication 
with nearest neighbors and control and I/O busses. 
Each PE is a rather sophisticated general purpose 
computer with seven working registers and 2K 64- 
bit words of memory, Our cells, in contrast, con- 
tain far less memory, smaller words, and more 
limited processing capability, retaining only 
enough computation power to evaluate some finite 
difference expressions and communicate with near- 
est neighbors, This permits LSI construction of 
each PE on a single microprocessor chip. This 
has important consequences which make the design 
potentially thousands of times as effective as 
ILLIAC IV, Reduced cost of components, accel- 
lerated by chip replication, permits a design 
with as many cells as mesh points in the physical 
problem. Instead of a potential 64-fold speedup, 
ratios in the range of several thousand to one, 
relative to sequential processing, are quite 
feasible, More important than the magnification 
of scale is the fact that an 8 x 8 mesh is insuf- 
ficient for most PDE problems, therefore much of 
the 64-fold potential speedup of ILLIAC IV is lost 
in overhead consisting of rearranging data to fit 
in the small grid. This overhead ts non-existent 
for our design since the computation grid corres- 
ponds exactly to the problem mesh. By restrict- 
ing as much of the communication as possible to 
nearest neighbors, i.e., avoiding long-distance 
high-bandwidth data transfers, effective execution 
rates of 5,000 to one million MIPs (millions of 
instructions per second) are attainable with 
present technology. 

At the opposite end of the PE quantity-—com- 
plexity tradeoffs are associative processors such 
as PEPE and the Goodyear STARAN. Far more cells, 
but each with lesser capability, are used. 
STARAN, for example, contains 32 256 word (256 
bits each) memory arrays. Though the communica-— 
tions paths are not comparable with ILLIAC IV, if 
the 8k words are regarded as cells, each is capa- 
ble of a few simple logic operations between its 
contents and those of a word to be matched. 
Arithmetic must be synthesized from these opera- 
tions via finite state logic in order to be car- 
ried out in parallel. Parallelism is achieved by 
having cells respond according to their contents 
(associatively) rather than by address. The 
individual words do not have enough processing 
power to carry out the computations required for 
mesh cells in reasonable PDE problems. 

Taxonomies for parallel (or other sequential) 
computers simplify comparisons between designs 
by abstracting essential features. Choices of 
what constitute essential features are moot and 
can cause Procrustian classifications, particu- 
larly when a new design such as the one proposed 
is involved, A popular taxonomy ts the division 
of instructions (I) and data (D) into single (S) 
or multiple (M) streams, For example, a pipeline 
machine such as the CDC STAR-100 or Texas Instru- 
ments ASC is classified as MISD since one data 
stream is passing though an "assembly line" of 


processors. Many parallel and vector machines, 
including ILLIAC IV, are classtfied as SIMD be= 
cause individual PE's contain different data, but 
all carry out the same instruction sent by a con- 
troller. While our design ts clearly MD, it could 
be regarded as either SI or MI, depending on how 
closely one looks at program control architecture, 
One alternative ts to allow each chip to store a 
few simple programs, such as the sequences of 
instructions for boundary condition computations, 
interior computations, and the null computation of 
cells outside the regfon. This view yields an | 
MIMD classification though the number of distin- 
guishable instruction streams is small if cells 
are viewed as types (i.e., boundary vs. interior) 
and large if viewed as individuals. All cells 
ought to be synchronized to the same clock to pre- 
serve integrity of the physical time modelled by 
the computation system. If the clock is regarded 
as the controller, the machine should be classi~ 
fied SIMD. Incidentally, we specifically avoid 
the machine organization of hanging multiple 
processors on the same bus as embodied in 
Carnegie-Mellon’s C.mmp [3]. That is, though the 
clock bus may have a wide enough bandwidth (say 8 
bits) to issue general orders to all cells, it is 
never to be used for addressing individual cells 
nor intercell communicatfton. This eliminates 
address conflict and dataflow bottlenecks (which 
rapidly worsen as the number of cells increases) 
but raises important questions about I/0. Loading 
programs and data into the cells could be done by 
feeding contents sequentially tnto one end of the 
array and using the nearest neighbor data paths to 
treat the entire array as a shift register, While 
far slower than the computatton this machine is 
designed for, most of thts machine's activity in 
PDE applications ts very heavily compute=bound 
rather than I/0 bound and the tnconventence is a 
small price to pay at the beginning and end of 
very long sequences of computattons., The output 
problem ts more important than input when sequen-. 
ces of glimpses at the physical system are 
desired, for example in displaying details of the 
transition from laminar flow to turbulence. Tech= 
nological solutions such as LED output or false 
color video interpretation of REStSree contents 
may be possible. | 

Despite the difference between our design and 
existing machines, there are many hardware 
implementation details in the latter that are the 
product of decades of expertence and expertise; 
these are worthy of thorough study for application 
in our design. 

Hardware organtzation has a large impact on 
software; a clear programming language ts essen- 
tial tf this machine its to be used effectively. 
High level languages for array machines [13, 8] 
have had to deal with the problem of arranging 
vectors or matrices so that FORTRAN-like arith- 
metic expressions can be simultaneously evaluated 
by the parallel processors of a particular system. 
Looping and tndexing required sophisticated analy- 
sis (see SIGPLAN conference proceedings at GISS in 
bibliography) compounded with the problems of com- 
piling a high level language such as FORTRAN, 
Since. no rearrangement of cells occurs in our 
design, and communication ts only between nearest 


parallelism, 


neighbors, indexing can be avoided entirely. The 
bookkeeping for sequencing through an array has 
been done during the fabrication of the array of 
chips. The subscripts in the finite difference 
equations are replaced by the labels of a few 
nearest neighbor cells. The simplicity of chip 
operations and stereotypy of the calculations 
means that we can get by with a simple assembly 
type language for describing the behavior 
(program) of a cell. The special purpose mission 
of this machine obviates the need for a general 
purpose high level language, thus saving consider- 
able time and manpower in software development. 

An area of parallel processing research which 
does not appear very applicable to our design is 
the parallel evaluation of ordinary arithmetic 
expressions [16, 29], Basically, operands are 
grouped by pairs (since operators are binary) for 
simultaneous processing, pairing the results until 
a single number is left. Nesting and operator 
hierarchies introduce complications, but the over- 
all idea is that the binary joining yields a tree 
structure whose depth (number of time steps) is 
logy of the number of operands (m) in the original 
expression, For the relatively short expressions 
(small number of operands) in the finite differ- 
ence equations, the logo (m) speedup does not 
appear to be great enough to pay for the overhead 
of finding the appropriate tree and arranging the 
data in it. However, cell programs should embody 
efficient arrangement of calculations, 

Research in parallel numerical algorithms has 
been intensively pursued, even prior to the like- 
lihood of implementation [17]. Many publications 
concern matrix methods for the solution of simul- 
taneous finite difference equations on grids [28]. 
Most of the results are not applicable to our 
proposed design because the row-column grids of. 
the matrices in those methods constitute a very 


different representation of the problem mesh than 


our array of cells, That is, such matrices really 
correspond closely to a graph connectivity matrix 
of the system, where mesh points and communication 
paths correspond to nodes and ares respectively. 
Local interaction corresponds to a sparse matrix 
whose number of rows (columns) ts the number of 
mesh points, Thus, for an n x n mesh, the matrix 
is n2 x n2 and topology is totally different from 
that of the mesh, depending strongly on the label- 
ling of rows and columns, For our 100 x 100 mesh, 
the matrix is 10,000 x 10,000; it is inconceivable 
that such a matrix be represented by one chip per 
element. Even if it were, sparseness implies 
wasted hardware and the matrix simplification _ 
methods in the literature [4] would require data 
transfers over large distances by pivoting, trans- 
pose and the like. Of course, the algorithms in 
the literature do not involve such inefficient 
data structures. Sparseness fs exploited to yteld 
a few vectors of length n2 which get rearranged to 
improve independence of operations and hence 
In sharp contrast, our design is a 
relaxation machine and data rearrangement ts 
avoided as much as possible. Nevertheless, the 
ingenious effort which has been expended in dtrect. 
methods warrants intensive study for applications 
here, 

Stone's observations [25] on parallel versus 
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serial numertcal algorithms are important enough 
to mention here before concluding. He notes that 
efficient serial algorithms may be totally unlike 
effictent parallel algorithms, that many 
apparently inherently sertal algorithms in fact 
are not, that numerical convergence rates differ 
in serial versus parallel algorithms, and that 
arrangement of data structures ts much more impor- 
tant in parallel processing than serial, The last 
point results from the fact that random access in 
serial processing makes all locattons equally 
accessible; random access between all processors 
in parallel is ruled out because of access con- 
flicts or interconnection structures whose com- 
plexity grows as the square of the number of 
elements. Hence communication is restricted to, 
for example, nearest netghbors, These observa~- 
tions suggest that much of what we know about 
serial processing is not applicable to parallel 
processing. This idea is confirmed on a fundemen- 
tal level by recent developments tn the theory of 
formal languages [11]. Study of formal systems 
yields principles whose value in applications 
includes determining algorithm design, programming 
language principles, hardware complexity trade- 
offs, and establishing ultimate limitations on 
computation power [24], Recent results in paral- 
lel grammars indicate that many problems which are 
high on the complexity scale when algorithms are 
Stated sequentially are much simpler when parallel 
operations are permitted. Detailed investigation 
of a broad spectrum of formal languages find this 
to be the general case rather than an exception 
[19]. Immediate applications include the design 
of parallel programming languages. The powerful, 
unexpected new results characteristic of this 
field suggest long range applications in the 
design of future parallel computers. 
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CONTROLLING THE ACTIVE/INACTIVE STATUS OF SIMD MACHINE PROCESSORS 


Howard Jay Siegel 
School of Electrical Engineering 
Purdue University, West Lafayette, IN 47907 


Masking schemes are used to control the 
active/inactive status of each of the processing 
elements (PE's) in an SIMD machine. 
data conditional masks, general masks, and PE 
address masks - are analyzed in terms of hardware 
and software implementations, time and space re- 
quirements, ability to activate arbitrary sets of 
PE's, and ease of programmer use. 

The PE address masking scheme uses an m-posi- 
tion mask to specify which of N=2™ PE's are to be 
activated. Each position of the mask will contain 
either a0, 1, or X ("don't care'') and the only 
PE's that will be active are those whose address 
matches the mask: 0O matches 0, 1 matches 1, and 
either 0 or | matches X. Superscripts are repeti- 
tion factors; square brackets denote a mask. The 
structure of these masks allows them to perform 
tasks such as activating the following sets of 
PE's: even numbered PE's - [X™ 10]; the 2! PE's 
beginning with J, where J > 2! and J= 
fi eceadlinn. J 00 mers? pagh,% ls every 2! th 
PE beginning with J, where J < 2!' and J = 


; ee on 
Speyer s 39 [x jj-y77-5,J,1- 


A mask may accompany each instruction, or may 
be executed whenever a change in the active status 
of the PE's is required. Consider the task of 
activating an arbitrary set of PE's with a set of 
these masks. For example, if N=8,and only PE 0 
and PE 7 are to execute instruction A, A must be 
executed with [000] and then with [111]. The set 
of masks is required to be of minimum size and 
each PE to execute a given instruction must 
execute it exactly once. 

Theorem: The lower and upper bounds on the size 
of the set of masks necessary to activate an 
arbitrary set of PE's are N/2. 

Proof: Lower bound: Let J be the set of PE 
addresses whose binary representations each con- 
tain an odd number of ones. If a mask has an X 
in its ith position, then it will activate two 


PE's whose addresses are identical except for their 


ith bit position, i.e., one address will have an 
odd number of ones, the other even. 
address in J, whose size is N/2, a separate mask 
will be required. Upper bound: Consider the 
arbitrary pair of PE's Yara %2¥ 19 and 


y 1°°°YoY,!- 


=" If both are to be activated use 


This is a summary of Purdue TR-EE 77-25, supported 
by NSF Grant DCR 74-21939 at the Princeton 
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Thus, for each 


Three schemes - 
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re tYo¥ XI, if only the former use 


-+e¥,¥,0], if only the latter use 

ee Vo¥i ll, and if neither do nothing. {> 
It is most likely that sets of PE's such 

as J above will be anomalies, but this is highly 

user dependent. By using a modified Karnaugh map 

procedure the smallest set of masks necessary to 

activate a given set of PE's can be computed. 

A negative PE address mask is the same as a 
regular PE address mask, except it activates al] 
those PE's which do not match the mask. Negative 
masks are prefixed with a minus sign. For 
example, [-1™] activates all PE's except 1™, a 
task which would require m regular PE address 
masks. In most, but not all, cases the combina- 
tion of negative and regular masks is better than 
regular masks alone. 

Theorem: The lower and upper bounds on the size 
of the set of masks, consisting of both types of 
masks, necessary to activate an arbitrary set of 
PE's are N/2. 

Proof: Similar to previous theorem. 

To quantify the additional power negative 
masks contribute consider the number of dis- 
tinct masks formable from these two schemes, 
where two masks are considered to be distinct if 
they activate different sets of PE's. 

Theorem: The number of distinct masks formable 
from the set of all possible regular and negative 
PE address masks is 2(3™-m). 

Proof: The regular PE address masks form 3” 
distinct masks. The mask [-X™] does not acti- 
vate any PE's. A negative mask with m-l X's 

has an equivalent regular mask, e.g. [X™ 1] = 


ly. 


[xm-19]. Each negative mask with fewer than 
m-1 X's is distinct from any regular mask. There 
are 3"-(2m+1) such masks. 

Let n be the number of X's in a mask. A reg- 


ular mask activates 2" PE's and a negative mask 
activates N-2" PE's. 

lf two bits are used to represent each mask 
position, then a fourth symbol, in addition to 
0, 1, and X, could be used. For example, 

"'S'' could mean ''same as the bit to the right,'' so 
that [1SX] would activate PE's 4 and 7, or "'D"! 
could mean "different from the bit to the right," 
so that [DDX] would activate PE's 2 and 5. 

The PE address masking scheme and its varia- 
tions present aclear and concise notation for 
masking. If we assume each PE knows its own 
address then, using data conditional statements 
for software decoding, the notation of PE address 
masks could be implemented without additional 
hardware costs. 

The analyses and comparisons presented in the 
full paper afd the machine designer in choosing 
a masking system and a method to implement it. 


ARCHITECTURAL DESIGN CONSIDERATIONS FOR A 
FAULT-TOLERANT ARRAY PROCESSING system (4) 


Alexander Thomasian (b) and Algirdas Avizienis 
Computer Science Department 


University of California, Los Angeles | 


Los Angeles, California 90024 


Summary 


A large number of architectural issues should 
be resolved in designing parallel processing sys- 
tems for large scale numerical computations. We 
discuss here the approach adopted in a particular 
case, the design study of an array processing sys- 
tem called the Shared Computing Resource - SCR 


Ci], C2]. 


Realtzatton of htgh computattonal capactty 
for array processing. High computational capacity 
is achieved in the SCR system by means of two ar- 
rays of homogeneous units. An array of multifunc- 
tional, high-speed arithmetic processors is pro- 
vided to perform arithmetic operations on large 
arrays of data. An array of address generators 
handles the fetching and storing of array operands 
residing in a large, high-bandwidth memory. 


Realtzatton of htgh performance for array 
processing. The evaluation of expressions involv- 
ing array operands is speeded up in the SCR system 
by allocating several arithmetic processors and 
address generators to the computation. For exam- 
ple, the evaluation of the vector expression: 
A<«BxC+D, requires two arithmetic processors and 
four address generators. The evaluation of vector 
expressions can be speeded up in the SCR system by 
applying segmentation, which consists of breaking 
down long computations into segments which are 
then distributed among the units. Because of the 
additional setup time overhead incurred when this 
approach is applied, task.segmentation is per- 
formed considering the length of the vector oper- 
ands involved in a computation and the availabil- 
ity of other tasks to be executed in the system. 
This leads to the space-sharing approach, where 
several computations can execute concurrently in 
the SCR system. 


Scheduling of computattons. A centralized 
scheduling unit performs the scheduling of compu- 
tations in the SCR system. The scheduler keeps 
track of the status of the SCR units and initiates 
a computation when the resources required for the 
execution of a computation become available. When 
a computation is scheduled for execution, a con- 
trol unit sets up the assigned arithmetic proces- 
sors and address generators, as well as the data 
transmission paths among them. The assigned units 
-then proceed autonomously with the computation and 
a very simple intercommunication scheme is 


(a) This research was sponsored by the National 
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required to coordinate their operation. The 
scheduling and setup overhead is acceptable in 
the SCR system, since it is prorated over a large 
number of array elements. 


Continuous avatlability. Continuous availa- 
bility is achieved in the SCR system by applying 
the pooling approach, such that computations are 
assigned dynamically to the SCR units. It is as- 
sumed that the arithmetic processors and address 
generators have builtin checking capability to 
detect hardware failures. When a unit fails, the 
computation whose execution was suspended is re- 
enqueued for execution. A prerequisite of this 
scheme is that each computation can only request 
a subset of the SCR units. 


Memory bandwidth utiltzation. Due to the 
high computational capacity associated with mul- 
tiple functional units, the performance of the 
system might be constrained by the bandwidth of 
the memory holding the array operands. The per- 
formance of the SCR system can hence be increased 
by providing an interconnection network among the 
arithmetic processors. The data-flow graphs of 
array computations are then mapped into this in- 
terconnection structure by setting up the 
configuration required by the computation. 


Condtttonal processing of vector operands. 
At a small additional cost in hardware complexity 
and making use of the multiplicity of functional 
units in the SCR system, efficient processing of 
conditional expressions with vector operands can 
be realized. More generally, the SCR system can 
be shown to be suitable for the direct implemen- 
tation of most array processing operators in APL. 
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PEPE HARDWARE AND SYSTEM OVERVIEW 


Alf John Evensen 
System Development Corporation 
Huntsville, Alabama 35805 


Summary 


The PEPE (Parallel Element Processing 
Ensemble) hardware has been produced and is 
currently installed at the Ballistic Missile 
Defense Research Center. The Experimental 
facility includes a partial PEPE machine and the 
software system needed for coding, evaluating, 
and demonstrating experimental BMD processes on 
the machine. The PEPE operates in conjunction 
with a Burroughs B1700 Test and Maintenance 
computer and a CDC 6400/7600 host computer. 
This paper presents current details relative to 
the existing physical and performance charac- 
teristics which are shown in Tables 1 and 2. 
PEPE is configured as shown in Figure l. 
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Table 1. PEPE Element Bay Characteristics. 


Partitioning 4 rows with 9 elements/row (36 elements/bay) 
System Capability 8 bays (288) elements 
Bay Profile 4 rows of element boards and each row 
provided with individual power supply 
(9 elements) 
Element 6 boards/element 


Board Dimensions 16" X 18" X .1" 


Multilayer Configuration 8 copper PC layers: (S,, Sos GND, V7; 


cc 
Vip» GND, S5, S,) 


Dual In-Line Packages 300 DIPs/board (maximum) attached by 


means of sockets 
Power Dissipation/Board 130 W (average) 
Power Dissipation/Element 20 kW 


Power Supply Dimensions 18" X 17.75" X 26". 


5.2 Volt Supply Load 560 A/supply 


2.0 Volt Supply Load | 544 A/supply 


Dimensions 84" Height X 82" Width X 30" Deep 


Cooling Chilled-water heat exchanger, with forced- 


air dual blower 


Table 2. Control Console Characteristics 


PC Boards 3 Rows of 54 Boards/row 


Board Partitioning 10 ACU 

CCU 

AOCU 

ACU Memory 

Input/Output Units 

EMC~-ODC 

ICL 

MCDU 

Spare Positions 

Door Mounted Signal Distribution 

System Boards 
Board Configuration 6 Layers (S,. GND, V GND, S$ 
Plus Wire Wrap 


cc’ VEE’ 2) 


Power Dissipation 7 kW 
Dimensions 84" Height X 82" Width X 30" Deep 


Cooling Chilled Water Heat Exchanger With Forced- 


air Dual Blower 


CPU (CDC 7600) 


SCM BUFFERS 
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Figure 1. PEPE Functional Configuration 


185 


NUMERICAL WEATHER PREDICTION IN 
THE PEPE PARALLEL PROCESSOR 


Howard O. Welch 
System Development Corporation 
Huntsville, Alabama 35805 


Abstract -- The mapping of a generic numeri- 
cal weather prediction model onto the PEPE paral- 
lel associative array processor is described in 
this paper. The case study demonstrates that the 
PEPE array processor can apply significantly 
greater computational power to the problem than 
can be achieved with available conventional 
computer architectures. The paper describes PEPE 
architecture, mapping of the finite difference 
approximations onto the array, conversion from 
Fortran to Parallel Fortran, run time measure- 
ments, and comparative results. 


The study shows that PEPE architecture is 
well matched to finite difference approximations 
for the solution of partial differential equa- 
tions even though there is no hardware provision 
for parallel interelement data transfer, provided 
the finite difference mesh is fairly large. 


Introduction 


The PEPE processor was designed expressly to 
handle enormous real time data processing loads 
of the types encountered in ballistic missile 
defense (BMD) applications. PEPE was designed as 
an augmentation of a commercial serial computer 
to assume that part of the BMD data processing 
load having the following characteristics: 

1) Correlation of input data with the 
existing data base by one or more 
attributes. 

2) Repetitive, highly arithmetic pro- 
cessing on a large number of indepen- 
dent data sets. 

3)  Multi-leveled ordering and search of a 
large, complicated data base. 


The PEPE computer architecture, a parallel 
array with three independent processors per array 
element and associatively addressed operand 
memory, is well suited to this data processing 
problem [1]. The BMD requirements for radar 
return correlation, digital filtering, tracking 
and radar resource allocation are solved in the 
three independent PEPE processing units described 
in the following section. 


PEPE, while designed specifically for the 
BMD problem, has a general purpose instruction 
set and the PEPE parallel Fortran language (PFOR) 
does not limit the user to any specific appli- 
cation [2]. There is therefore a reasonable 
expectation that the potentially enormous data 
processing power inherent in parallel associative 
architecture might be applied to other problems 
which currently severely stress or are beyond the 
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capability of the most powerful commercially 
available computers. 


The second order partial differential equa- 
tions of fluid flow do not, in general, have 
analytic solutions and hence are solved only with 
numerical methods which require great computa- 
tional power. Of these equations, those describ- 
ing hydrodynamic flow, global weather models, and 
magnetohydrodynamic models are of current inter- 
est. [3][4][5] | 


Parallel computer architectures expressly 
designed for solution of partial differential 
equations have featured provision for interelement 
data transfer to solve the finite difference 
representations of the equations. This paper 
shows that PEPE associatively addressed array 
elements suffice for interelement data transfer 
given a large finite difference lattice; and 
furthermore that. the PEPE unstructured array 
architecture provides certain advantages over a 
machine with direct hardware interelement communi- 
cations. 


SDC has selected the global weather modeling 
problem as representative of the general class of 
partial differential equations and has coded and 
executed a benchmark supplied to SDC by the 
Geophysical Fluid Dynamics Laboratory, Princeton 
University. This paper describes the implementa- 
tion and results of the benchmark effort. 


PEPE Architecture 


PEPE (Figure 1) consists of an ensemble of 
independent digital processing elements, indefi- 
nite in number, which operate in parallel under 
global control. The current 288 element PEPE 
configuration [1] has three modules, each con- 
sisting of an independent global control driving 
an associated processing unit in each PEPE pro- 
cessing element. The three processors in the 
element share a common element data memory for 
parallel operand storage. The three modules are 
optimized for correlative data base search and 
data input, floating point arithmetic and associa- 
tive data base search and output respectively, 
reflecting a design response to the BMD data 
processing problem. 


Each control unit has a data memory and a 
program memory independent of the other global 
control units. The global control units have a 
limited data processing capability, the instruc- 
tion repertory being limited to logical test, 
branch, index register and input/output control, 
plus sufficient integer arithmetic to compute 
indexes and addresses. This instruction set 
serves primarily to control the parallel instruc- 


tion sequence rather than to execute code 
directly related to the solution. Parallel and 
global control unit instructions are stored in 
the program memory associated with the global 
control unit. 


PEPE ARCHITECTURE 


HOST COMPUTER 
CORRELATION ARITHMETIC OUTPUT 
CONTROL UNIT CONTROL UNIT CONTROL UNIT 
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CORRELATION 
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OUTPUT 
UNIT 


Figure 1. PEPE Architecture 


Parallel instructions are decoded in the 
global control units to cause a microinstruction 
sequence of control pulses to be moved from a 
wide bandwidth, read-only memory to the parallel 
instruction bus for transmission and execution in 
the processing elements. The control pulses 
cause simultaneous and identical operations in 
each of the active processing elements. 


Each processing element has three units, 
corresponding to and independently controlled by 
the three global control units by means of three 
instruction/data busses. Each processing element 
has 2048 words of memory shared by the three 
element processing units for parallel operand 
storage. 


Each of the three element processing units 
has an associated activity indicator which can be 
programmed active or inactive according to the 
contents of the unit's accumulator or other 
condition register. The parallel instruction 
stream from a control unit is routed only to 
active element processing units. A full set of 
parallel instructions, analogous to logical test 
and branch instructions, allow the programmer to 
set activity and hence to control the set or 
subset of elements which participate in proces- 
sing. The subset can be selected by data attri- 
bute rather than data location address, hence, 
the claim for associative memory in PEPE. A 
hardware extremum search algorithm is included in 
the parallel instruction repertory [6]. This 
instruction allows selection of the processing 


element with the largest (or smallest) accumulator 


value, providing a data ordering capability in 
the PEPE hardware. 


OUTPUT 


INSTRUCTIONS 


Consider the aggregate of element memory as 
a two-dimensional MxN memory array with any 
column formed by the M words of a single element 
memory, and rows formed by the N PEPE processing 
elements. Rows are addressed conventionally by 
the parallel instruction operand address; each 
row consists of a vector whose elements are 
defined by the set of active processing elements, 
where activity is specified by the activity 
selection instructions described above. Arithme- 
tic and logical operations are performed simul- 
taneously on all vector elements so that data 
access and manipulation in the row dimension, 
i.e., across the processing elements, uses the 
same processing time for n vector elements as for 
one element in the array or for none. The PEPE 
instruction format has no direct hardware pro- 
vision for addressing any specific physical 
element and there is no provision for specifying 
any set size (except for isolation of a single 
element). 


Direct communication between two physically 
adjacent PEPE elements is not provided, reflec- 
ting the BMD origins of PEPE architecture. The 
only mechanism available for interelement data 
transfer is to move data from a given element to 
the global control unit and then back to other 
elements. This process is serial in nature and 
hence slower than purely parallel operations in 
PEPE. 


GFDL Benchmark 


The Geophysical Fluid Dynamics Laboratory 
(GFDL) benchmark is a program for evaluating 
large computer performance. The benchmark 
resembles currently used atmospheric simulation 
models but does not contain a complete set of 
consistent atmospheric processes. Its purpose 
is the exercise of the computer rather than a 
physical simulation. 


Physical processes modeled in the bench- 
mark include horizontal and vertical advection, 
earth's rotational effects (Coriolis Force), 
horizontal pressure gradient, non-linear horizon- 
tal viscosity, and diffusion, and heat convec- 
tion. Excluded physical processes include 
vertical diffusion, a hydrologic cycle, radiation 
and surface effects. See Appendix A for the 
differential equations governing the model. 


The prediction domain is the Northern and 
Southern hemispheres, each projected on a polar 
stereographic plane tangent at the pole. Pro- 
vision is made to confine flow within a hemis- 
phere. The vertical domain is divided into 
nine spatially unequal layers defined in a 
normalized pressure (sigma) coordinate system. 
See Figure 2. 


At each time step, the tendencies of the 
prediction variables are evaluated in a vertical 
slice of nine rows which form a plane of the 
finite difference lattice. The total length of 
each row is 161 points, although computations 
are carried out at only a limit number of 
points, the number being a function of latitude 


and averaging 125 over a single time step. 


i - Longitudinal Index 
j -. Latitudinal Index 
k - Altitudinal Index 


Figure 2. Finite Difference Lattice 


PEPE Implementation 


General Characteristics 


Time is advanced in the model by the equa- 
tion ) 


t+t1 
a = a 


0a 


t- 1 
| ot 


+ 2At for all t > 0 
where a represents any prediction variable and. 
the superscript refers to time. 


The finite difference representations of 
the equations take the form | . 


se 
ijk 


foe ot. at! ) 
ijk, biky ae aL ee 
where the subscripts refer to spatial position 
in the finite difference lattice. 


£ (a 


Data for the finite difference lattice are 
stored in columns, 9 points deep. The predic-— 
tion variables require 39 single precision 
computer words of storage, hence the lattice of 
161 x 125 columns requires about 785000 words 
of storage. The integration algorithm requires 
space for three complete sets of these data for 
a total of 2.35 million computer words. This 
storage requirement exceeds the capacity of 
PEPE element memory so that mass storage, 
accessed through the CDC 7600 host, is used. 
The time update is performed one lattice plane 
at a time. A plane represents all points at a 
given latitude and may vary from 25 to 161 
points. Each column of a plane is evaluated in 
a PEPE element so that a complete plane is 
processed as vectors of the prediction variables. 


The PEPE correlation unit is used as the 
input device. Data are read from mass storage 
by the host, formatted into messages and trans- 
mitted to PEPE. Two messages per lattice 
column are required, one for the time step t - l 
values and one for the time step t values. : 
Since time t data from the adjacent lattice 
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columns are required for update of any column, 
the correlation unit stores data from a given 
column into the two logically adjacent elements. 
It is not necessary that data from adjacent 
lattice columns be stored in physically adjacent 
PEPE elements since the associative addressing 
of elements suffices to find logical adjacencies. 


Adjacencies in latitude and altitude are 
accessible directly in element memory and hence 


do not require the redundant data storage de- 


scribed above for the longitudinal adjacencies. 


Programs 


The PEPE implementation of the GFDL bench- 
mark was designed according to the following 
criteria: is 

1) Provide maximum vector lengths for 
the predictions for maximum execution 
time reduction over the serial imple- 
mentation. 

2) Provide maximum overlap of correlation 
unit and associative output unit 
operations with arithmetic unit 
operations. 


No changes to the algorithm are made, 
i.e., the data are processed in the 
same order and with the same arithmetic 
operations as in the serial implemen- 
tation. 


3) 


The code conversion for PEPE implementation 
falls into 6 categories: : 


1) FORTRAN subroutines essentially un- 
changed. - 

2) FORTRAN subroutines subtantially 
modified. 

3) Subroutines rewritten into PFOR, 
either retaining their distinct 
identity or incorporated into other _ 
subroutines. 

4) New FORTRAN code. 

5) New PFOR code. 

6) Deleted subroutines. 


The serial implementation consists of 33 
code segments (subroutines plus the main program) 
with 631 lines of FORTRAN code excluding data | 
descriptors and nonoperable statements. The 
PEPE implementation consists of 23 code segments 
with 854 lines of code. Of this code, 229 
lines are FORTRAN, executed in the host and 625 
lines are PFOR executed in PEPE. 


Of the host code, three subroutines are 
directly taken from the serial implementation. 
Two of them, DUMDAT and CONST are essentially 
unchanged while the third, GFDL, is approxi- 
mately 50% rewritten and radically restructured 
to interface with the PEPE/7600 Real Time 
Executive. 


The PEPE or PFOR code consists of 17 code 
segments. Twelve of these segments, totaling 
519 lines of code, are direct translations to 
PFOR of 30 FORTRAN subroutines totaling 481 
Statements. Many small subroutines were incor- 
porated into the calling code segment, either 
because the subroutine was specific to the data 
Structure of the serial implementation or 
because the number of formal call parameters 
appeared difficult to handle in the parallel 
implementation. 


Five new PFOR routines totaling 106 state- 
ments were written. Four of these provide the 
message handling code for input and output of 
data and constants while the fifth is a segment 
of the control program GFDL moved to PEPE. 


The primary reasons for the 35% increase 
in the number of lines of code appear to be 
from four causes. First, calculation of some 
intermediate variables is done three times; for 
the i- 1, i, and i+ 1 indexes, in order to 
obviate any requirement for interelement data 
transfer. Second, 18 FORTRAN subroutines were 
incorporated into in-line code, in some cases 
several times. Third, the new subroutines to 
interface the host and PEPE represented new and 
unique requirements. Last, certain routines 
were highly logical and branched in the FORTRAN 
version and did not efficiently convert to 
PFOR. 


Of the 625 lines of PFOR code, approximately 
451 lines or 72% was directly transferred from 
the FORTRAN with only a change in the index 
specification of the data items. 


Data Base Conversion. The major design 
task in converting a program from serial to 
parallel implementation is the conversion of 
the data base. PEPE has a complicated data 
memory structure in that it has element memory, 
three global data memories and, in the case of 
the CDC 7600, a large core, small core and mass 
storage. 


This conversion placed a major emphasis on 
the PEPE configuration, hence little effort was 
expended to simplify or reduce host data storage 
requirements even in cases where redundant or 
nonrequired data space was used. 


The mass storage interface was essentially 
left unchanged, except for the initial mass 
storage reads necessary to initialize PEPE at 
the start of each time step. 


Data is read from disc to the CDC 7600 
Small Core Memory (SCM) and then to Large Core 
Memory (LCM), one lattice plane at a time. The 
plane of colatitudinal lattice points 9 rus 
deep is stored in PEPE as columns with longitu- 
dinal index i distributed across the PEPE 
elements, i.e., each i entry is assigned to a 
PEPE element. The immediately neighboring 
columns, indexes i - 1, and i+ 1 are also 
stored in the i element so that there will be 
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no requirement for interelement data access 
during the update. Three lattice planes, 
indexes i - 1, j and j + 1 are also necessary 
to evaluate the finite difference equations. 
The time update equation requires the variables 
for point ijk at the previous time step. 


Finally, the overlapped I/O in PEPE requires 
that space be allocated for plane j + 2 and for 
the updated point at plane j - 1. This totals 
14 lattice columns stored in each PEPE element 
memory. Figure 3 illustrates the memory 
configuration. 


Figure 3. 


PEPE Element Memory Map 


Performance 


Measurement Methods 


The PEPE real time clock, counting at a 
5 x 10° Hz rate, is used for all execution time 
measurements. The clock provides measurements 
to a 200 ns granularity. There are 8 additional 
counters/timers in the PEPE arithmetic control 
unit, under program control, which can measure 
127 different PEPE hardware events or time 
intervals associated with those events. Events 
include control unit instruction execution, 
parallel instruction execution, element memory 
conflicts, etc. Timing in the eight counter/ 
timers takes place to a granularity of 100 ns. 
PEPE instructions control the reading and 
starting of the clock and counters, 


Measurement Limitations 


All timing information was collected on 
the Advanced Research Center PEPE hardware 
which has 11 processing elements. Execution of 
a problem which requires 161 elements maximun, 
125 elements average, may be timed with confi- 
dence on the 11 element PEPE due to the unstruc- 
tured nature of the associative array archi- 
tecture. The parallel instruction set execu- 
tion time is not, in general, sensitive to the 
number of active elements in the ensemble. 
Parallel code executes in exactly the same 
time with 11 elements, 125 elements or 161 
active elements. The only exceptions to 
this in the GFDL benchmark is in the 
mixing ratio adjustment subroutine MRADJ 
where interelement transfer of the mixing 
ratio variable is required. The execution 
time of this subroutine is strongly depen- 
dent on the calculated value of mixing 
ratio at each pressure altitude point in the 
updated lattice plane. Since no valid values 
of mixing ratio are calculated, execution time 
for MRADJ benchmark represents an absolute. 
minimum. 


Execution Time 


Execution time is measured over a single 
lattice plane and then extrapolated to a 161 
plane time step. The lattice plane was selected 
to have 125 columns, the average number of 
columns per time step. A 125 column lattice 
plane requires execution of 72254 instructions. 
Of these, 42565 instructions are executed in 
the parallel elements and 29689 are executed in 
the control unit, with the control unit instruc- 
tions substantially time overlapped with the 
parallel instructions. The average length in 
the 161 planes is 125, hence PEPE executes 
125 x 42565 + 29689 = 5350314 instructions per 
plane. Measured time per plane is .024703 
seconds giving an effective average instruction 
execution rate on the GFDL problem of 216.5 
million instructions per second. A complete 
time step of 161 lattice planes requires 3.977183 
seconds to complete. This is a minimum time 
due to the mixing ratio adjustment calculation. 


Table I describes performance of various 
machines executing 257 time steps on the GFDL 
benchmark and the extrapolated time for PEPE to 
execute the same problem. The estimated PEPE 
time is a minimum time and would be larger if 
the full 161 element PEPE were available to 
allow MRADJ to run on correct data. 
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Table I. Comparison Execution Times 


MACHINE TIME (MIN) 
IBM 360/91 3 I 
CDC 7600 ~ 


IBM 360/195 
TI ASC (FOUR PIPE) 
PEPE/CDC 7600* 


*EXTRAPOLATED TIME 


Table II provides a breakdown of executed instruc- 
tions and execution times of the subroutines in 
the benchmark. Totals differ slightly from above 
due to slight differences in measurement techni- 
que and overhead. 


Table II. Subroutine Execution Statistics 
ee Ee ee ce 
SUBROUTINE UNIT INST INST us 
NEXTRW 1336.8 
HRDIF1 1067.4 
INNER1 21282.0 
UTAUP 217.1 
PSTAR 124.4 
.0 
5 
3 
2 
2 
4 


Appendix A 


GFDL Benchmark Differential Equations 
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(13) 


(14) 


(15) 


(16) 


(17) 


(18) 


(19) 
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A(PsT) = heat convection where the lapse (20) 
Cc 
rate exceeds moist adiabatic 
(Par) | = moisture adjustment to avoid (21) 
negative mixing ratio 
Q = P/P, (22) 
u, Vv GRID ORIENTED, PRESSURE WEIGHTED 
HORIZONTAL WIND COMPONENTS 
T : TEMPERATURE 
R MIXING RATIO (MASS OF WATER VAPOR/ 
MASS OF DRY AIR) 
Px PRESSURE AT SEA LEVEL 
TIME ADVANCE 
A, = % + At) t = 0 
a. = + 2at (2% for all t > 0 
tt1 t-1 ot’ t 
References 


[1] 


[2] 


[3] 


[4] 


Evensen, A. J. and Troy, J. L., "Introduc- 
tion to the Architecture of a 288 Element 
PEPE,'' Proceedings of the 1973 Sagamore 
Computer Conference on Parallel Processing. 


Dingeldine, J. R., Martin, H. G. and 
Patterson, Wm., "Operating System and 
Support Software for PEPE," 1973 Sagamore 
Conference on Parallel Processing. 


Gates, W. L., BaHan, E. S., Kahle, A. B., 
and Nelson, A. B., "A Documentation of the 
Mintz-Arikawa Two Level General Circulation 
Model," APRA Document R-897-ARPA, December 
1971. 


Daley, J. and Underwood, B. D., "Short Term 
Weather Prediction on the Illiac IV," 1975 
Sagamore Computer Conference on Parallel 
Processing. 


[5] 


Chapman, D. R., Mark, H. and Pirtle, M. W., 
"Computers Vs. Wind Tunnels for Aerodynamic 
Flow Simulations," Astronautics and Aero- 
nautics, April 1975. 


192 


[6]. 


DiVecchio, M. C., "Design and Implementa- 
tion of a High/Low Magnitude Search Instruc- 
tion on PEPE," 1975 Sagamore Conference on 
Parallel Processing. | 


PEPE APPLICATION TO BMD SYSTEMS 


Charles E. Blakely 
System Development Corporation 
Huntsville, Alabama 35805 


Abstract -- The PEPE Development Program, 
for the past 4-1/2 years, has been concerned 
primarily with the design and development of an 
experimental hardware and software facility for 
conducting research on parallel and associative 
data processing techniques as applied to Ballistic 
Missile Defense (BMD) service. Preliminary 
investigative work on PEPE applications started 
in January 1974 employing functional simulation 
tools. 


Concurrent with the functional simulations, 
an analytic benchmarking effort on the hardware 
was conducted. The benchmarks were primarily BMD 
routines to collect data on the hardware 
functions. 


Some results of the PEPE benchmarking study 
for a Kalman Filter are presented and discussed 
in the paper. The PEPE results are compared with 
the results for several other computers bench- 
marked with the same filter. 


The results of the BMD benchmark studies 
have provided data for comparing the performance 
of PEPE, CDC 7600, CDC 7700, TI-ASC and CRAY-1. 


Curves are presented for the combined radar 
scheduling and object tracking function for the 
PEPE, CDC 7700, and the CRAY-1. These curves 
indicate that the PEPE can out perform the other 
computers for systems as small as 108 elements. 


Introduction 


The PEPE (Parallel Element Processing 
Ensemble) Development Program, for the past 4-1/2 
years, has been concerned primarily with the 
design and development of an experimental hardware 
and software facility for conducting research on 
parallel and associative data processing techni- 
ques as applied to Ballistic Missile Defense 
(BMD) service. The experimental facility includes 
a partial PEPE machine and the support software 
system needed for coding, evaluating, and demon- 
strating experimental tactical processes operating 
on the machine in a simulated BMD environment. 


Preliminary investigative work on PEPE 
applications work involved an analysis of the 
utility of PEPE as an adjunct to the CDC 7700 
data processor. The study objective was to find 
out if it were possible to secure a moderate 
improvement in the data processing performance 
with no hardware modification other than the 
addition of PEPE to standard CDC 7700 interfaces, 
and with minimal change to existing software. 

The study successfully achieved its objective 
with a rather simple implementation of PEPE in a 
fast response "offloading" configuration, and led 
to another study to develop a more sophisticated, 
higher performance implementation, but still with 
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the constraint that changes to existing software 
be minimal. Results of this study were demon- 
strated using functional simulations. This study 
successfully achieved its objectives in July 
1975. 


The above demonstration was followed by a 
Study in which the minimum software breakage 
constraints were relaxed. The results of this 
study, which permitted a much higher performance 
implementation, were demonstrated by functional 
simulations. 


Concurrent with the functional simulations, 
an analytic benchmarking effort on the hardware 
was conducted. The benchmarks were primarily BMD 
routines to collect data on the hardware func- 
tions. The results of the BMD benchmark studies 
have provided data for comparing the performance 
of PEPE and the CDC 7600, CDC 7700, TI-ASC and 
CRAY-1. 


Benchmarks have also been run which permit 
a comparison of PEPE with the STAR-100, IBM 
370/195, AMDAHL 470/V6 and the TI-ASC 4 pipe. 


Curves have been derived for the combined 
radar scheduling and object tracking functions on 
the CDC 7700, PEPE and the CRAY-1. These data 
indicate that the PEPE can outperform the other 
systems with PEPE limited to 108 parallel ele- 
ments. 


Offload Studies and Results 


A study was made to identify the heavy 
resource users in a BMD system, and to investi- 
gate their behavior throughout a threat. The six 
highest resource users were identified as: 


Radar Interface Processing 

Object Tracking Processing 

Interceptor Control Processing 

Track Initiate Returns Processing 

Track Initiate Designation and Beam 
Pointing Processing 

Passive Object Discrimination Processing 


The three tasks with the most potential for 
implementing on PEPE were Radar Interface Pro- 
cessing, Cbiect Tracking and Passive Discrimi- 
nation. The main considerations in choosing 
these tasks were the CPU resources required by 
the tasks, commonality of data base, and inherent 
parallelism. The average CPU resource require- 
ments for the three tasks were 28.32%, 21.39% and 
9.4Z, respectively, for Radar Interface Proces- 
sing, Object Tracking and Passive Discrimination. 


A design, implementing these three functions 
on PEPE, was generated and implemented in an 
existing functional simulator. The Object 


Tracking and Passive Discrimination tasks were loops. Additional statistics obtained from the 


implemented in PFOR [1] (Parallel FORTRAN) di- simulator are shown in the table below. 
rectly on the parallel elements controlled by the 
Arithmetic Control Unit (ACU); however, the Radar [Host OnLy 7700-PEPE 7600-PEPE 


Interface Processing (radar scheduling) had to be : EE 
redesigned for implementation on PEPE. The pV EBAGEACEE A UShGE Sy) 

redesigned task consisted of two parts: 
radar returns sorting task (SORT) and (2) a radar 
pulse schedular (SCHED). SORT was designed to be fi RE DBASE Se 
implemented on the Correlation Control Unit (CCU) BOGE: SREY: BEE) 
while SCHED would operate on the Associative aie cases 


; P 3 
Output Control Unit (AOCU). oe 4 nee oe 


| LOOP 6 WAIT (MSEC) 
| LOOP 7 WAIT (MSEC) 


(1 ) a AVERAGE CPU 2 USAGE (%) 
AVERAGE CPU USAGE 


A PEPE/7700 system was designed with the 
radar data processing subsystem interface con- 
nected to the PEPE as shown in Figure 1. The 
Radar Interface, Object Track task and the 


Passive Discrimination function were implemented All of the data in the table are average 
on the PEPE as described above. The incoming percent utilizations and wait times in milli- 
data were sorted in the PEPE CCU and those data seconds. Data with respect to maximum, minimum 
not needed in PEPE were transmitted directly to and standard deviation (co) for CPU usage are 
the 7700. presented in the following table. 


PEPE/7700 
cPU 1 CPU 2 7600 PEPE/7600 


CDC 7700 


RADAR 
SYSTEM 


| AVERAGE 


NON-OBJECT STD DEVIATION 


RADAR RETURNS 


eee ee Continuous curves of CPUl and CPU2 loading 
t are shown in Figures 2 and 3, respectively. The 
PEPE unbalance between CPU1 and CPU2 loading was 


caused by dedicated tasks left on CPU2. 


100.0 100.0 


OBJECT 
TRACK 
RETURNS 


80.0 80.0 


@— 7700 


60.0 60.0 


40.0 40.0 


CPUL UTILIZATION (2) 


“@— PEPE/7700 


20.0 


0.0 2.0 4.0 6.0 8.0 10.0 12.0 
SIMULATION TIME 


Figure 1. PEPE/7700 Data Flows 
Figure 2. CPU1 Utilization Versus Time 


A simulation of the above system, employing 
the SDC PEPSIE executive, was implemented and 
driven by a simple threat generator. The results 
of the simulation runs indicate that the average 
7700 CPU loading was reduced by 40%, queue build- 
ups were almost eliminated, 7700 CPU1 peak loading 
never exceeded 65% utilization, 7700 CPU2 peak 
loading never exceeded 80% and polling loops 
delays were significantly reduced for all polling 
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Figure 3. CPU2 Utilization Versus Time 


Subsequent to the simulation study, the most 
important resource users implemented on PEPE have 
been benchmarked on the hardware. The results of 
these studies are presented in the next section. 


BMD Benchmarks 


The results of a PEPE benchmarking study for 
a Kalman Filter object tracking routine are 
presented in this section. The particular 
filter implementation employed in this study was 
a seven state fully coupled version developed by 
Teledyne Brown Engineering several years ago for 
a Computer Comparison study. The filter is 
referred to as the TKPRC subroutine throughout 
this paper. 


The TKPRC subroutine was run on the TI-ASC 
and compared with the CDC 7600. The code was 
then extensively revised to adapt it to the ASC 
as a result of this Comparison [2]. The recoded 
version for the ASC showed a very substantial 
reduction in execution time, which was attributed 
to the increase in code vectorization. The 
subroutine executed in virtually a pure vector 
mode at this point. 


The initial code as run on the ASC was then 
executed on the CDC 7600. The one exception was 
that the ASC FORTRAN extensions were replaced by 
standard FORTRAN statements for the CDC 7600. It 
is interesting to note that the execution time on 
the 7600 was approximately 20% less due to the 
vectorizing done for the ASC. 


The filter implemented on the CRAY-1 was the 
version run on the CDC 7600. Thus the subroutine 
TKPRC was initially optimized for the TI-ASC when 
run on the CDC 7600 and the CRAY-1. The filter 
initially implemented on PEPE was the original 
unoptimized version of TKPRC. It was subsequently 
optimized for matrix operations as described in 
the next section. 


Implementation On PEPE 


The original version of the TKPRC subroutine 
was translated directly from the FORTRAN code to 
PFOR. The time required to plan the conversion, 
prepare the coding sheets, and obtain the first 
debug run was 8 hours. 


When this version was run and timed on PEPE, 
the execution time was 6.98 milliseconds. An 
examination of the code revealed that the covar- 
lance matrix prediction process was extremely 
inefficient on PEPE. The covariance matrix 
prediction process was recoded in PFOR (not 
assembler) and new timing runs made. Subse- 
quently, all the matrix operations were recoded. 
The recoding consisted of removing a total of 20 
lines of FORTRAN code and adding 12 lines of new 
code. The results of this change was to reduce 
the run time to 4.10 milliseconds. 


Results 


Table I contains the results of running and 
timing the subroutine TKPRC on the PEPE hardware. 
Several interesting results are contained in 
Table I. The most impressive result is the large 
reduction in the number of sequential instructions 
executed when the matrix multiply was optimized 
(reduced to 4920 from 20,964). This largely 
accounted for the reduction from 6.8588 to 4.1446 
milliseconds run time. Another interesting 
result is the 2.57 MIPS effective speed for the 
ACU. This result is due to the overlapping of 
the parallel and sequential instruction executions. 

Table I. Data for One Cycle of Filter 
for N Objects 


MATRIX MULTIPLY BEST SO 
OPTIMIZED FAR 


ORIGINAL CODE 


NO. OF PARALLEL 
INSTRUCTIONS 
EXECUTED 


NO. OF SEQUENTIAL 
INSTRUCTIONS 
EXECUTED 20964 4920 


RUN TIME 6.8588 MILLISEC 4.1446 MILLISEC 


PARALLEL INST. TIME 4.2435 MILLISEC 3.9870 MILLISEC 


SEQUENTIAL INST. TIME 2.1066 MILLISEC 0.5040 MILLISEC 


| MIPS 3.98 2.57 


Figure 4 shows a comparison of the run times 
for the CDC 7600, TI-ASC and the PEPE. The data 
for the CDC 7600 and the TI-ASC were taken from 
[2]. Figure 4 shows a crossover at 7 targets for 
the PEPE versus the CDC 7600. This means that 
the PEPE is more efficient for object tracking 
alone when the number of targets exceed 7. 

Stated another way, the full PEPE is 41 times as 
powerful as the CDC 7600 while tracking 288 
targets. When compared with the TI-ASC, the 
crossover is at 8 targets or 36 times more 
powerful for a full target load. 


TIME IN MILLISECONDS 


NS vector (ASC) 


PEPE 


1 2 3 4 5 6 7 8 9 10 11 12 #13 #14 «15 «#16 =«#117 


N-OBJECTS 


Figure 4. TRACK Benchmark - Subroutine TKPRC, 
Serial/Vector Crossover 


The results for the TKPRC benchmark on the 
CRAY-1 are for the CDC 7600 optimized version of 
the filter [3]. That is, the filter on the 7600 
would be approximately 15% better than the 
results from [2]. The filter was hand coded in 
assembler for the CRAY-1 since a FORTRAN Compiler 
did not exist at that time. Employing the 
results from [3] of 2.17*10 ~ seconds for 64 
objects, the PEPE is 2.4 times more powerful than 
the CRAY-1 for object track. The crossover is at 
121 targets. These results are considered to be 
an optimistic upper bound for the CRAY-1 due to 
the fact that a Fortran compiler was not used in 
the tests. 


Multiple Task Performance 


The preceeding results were derived for a 
single function, object tracking, operating on 
each computer. 
can only be assessed when multiple tasks are 
present. In this section the preceeding results 
are extrapolated to the combined task of object 
tracking and radar scheduling. Since TKPRC track 


filter benchmark has been run on all the computers 


discussed in this paper, run time equations can 
be derived. The method for deriving timing 
estimates for the combined functions is discussed 
below. 


Figure 5 contains the plotted data for TKPRC 
on the CRAY-1 computer. The data have been 
reduced to run times for one iteration versus N 
objects. 
a straight line of the form 

y=mxtb. 

The results for the data in Figure 5 are 


y (CRAY TKPRC) = .2939 + .02923 N 


where y is the run time in milliseconds. The 
data in Figure 4 can also be fitted to derive a 


The true performance of a computer 


The best fit to the data appeared to be 


similar curve for the CDC 7600 [4]. The results 


were . 
y (7600 TKPRC) = .5 + .5N milliseconds. 


Run time estimates for the RADAR SCHEDULER 


function (operating on the 7600) were derived 


from the PEPE applications simulator results. 
The time required to consider M objects for 
scheduling by the Radar Interface Processing 
function (in milliseconds) was found to be 


y(7600 SCHED) = .35889 + .1298 m, ‘?? 


Run times for Radar Scheduler operating on 
the CRAY-1 were derived in the following manner. 
Assume that Radar Scheduler would be implemented 
in the sequential unit of the CRAY-1 since it 
does not appear to lend itself to vectorizing. 
The cycle time of the CRAY-1 is 12 nanoseconds 
which is 2.2 times as fast as the 27.5 nano- 
seconds for the 7600. Using these assumptions, 
the equation for the CRAY-1 Radar Scheduler run 
time in milliseconds is 


y (CRAY SCHED) = .163 + .059 M. 


The above equations do not contain any 
allowances for differences in the number of 
machine cycles required to execute an instruc- 
tion, overhead, interrupts, etc. Since the 
sequential and vector operations are mutually 
exclusive in the CRAY-1, the combined run time 
for Object Tracking and Radar Scheduling is given 
by 


y(SCHED + TKPRC) = .02923 N+ .059 M+ .457. 


The PEPE run time for the TKPRC track 
filter is a constant of 4.1 milliseconds for up 
to 288 objects or the maximum number of elements 
in the system. Therefore, the PEPE TKPRC equa- 
tion, in milliseconds, is 


y(PEPE TKPRC) = 4.1. 


The PEPE run time for the Radar Scheduler 
function taken from the Simulator mentioned above 
is given by 


- y (PEPE SCHED) = .4 + .0432 M. 


(a) 


Subsequent studies have shown that the 
coefficient of M may be as large as .25. 
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LINEAR FIT T = 


2939 + .02923N 
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TKPRC Benchmark 


Figure 5. 


The above equations have been plotted on 
Figure 6 as a function of the number of objects 
in track plus the number of instances inputted 
for radar scheduling [5]. The results indicate 
that a PEPE can outperform the other systems for 
systems as small as 3 cabinets (or 108-elements) 
of parallel elements. Larger systems permit 
large reserves for growth. 
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Figure 6. Combined TKPRC and TRIP Run Times 


Matrix Factorization 
A matrix factorization benchmark, employed 


in a benchmarking effort at the Systems Engineer- 
ing Laboratory University of Michigan [6] was 


197 


programmed and run on the PEPE. This benchmark 
is an example of the application of a problen, 
for which the machine was not designed, to the 
PEPE. The matrix factorization process requires 
the transfer of data between the elements which 
is an essentially serial process in the present 
design. Figure 7 shows the PEPE results super- 
imposed on the University of Michigan results. 
It is evident from these data that PEPE is 
competitive on this problem for very large 
matrices. The fitted equation for these data is 


T(manoseconds) = 6544.28 + 5705.0N + 4417.86 No 
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Summary and Conclusions 


It has been demonstrated through simulations 
and benchmarking that the PEPE can be introduced, 
as designed, into existing BMD systems and 
assume approximately one-half of the CPU load. 

If PEPE were designed into the system at the 
beginning, it is estimated that it could assume 

up to 75% of the CPU loading. The PEPE system 
provides for considerable growth by adding cabinets 
of parallel elements up to 288. 


Benchmarking efforts are continuing to 
demonstrate the application to other areas of the 
BMD problem, such as the processing of optical 
data. 


Note 


The data and conclusions presented in this 
paper are the results of a preliminary evaluation 
effert. These conclusions do not represent an 


official BMDSCOM position since further detailed 
studies are now underway utilizing a different 
design configuration. 
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A PARALLEL PROCESSOR APPROACH FOR SEARCHING DECISION TREES 


Duane David Marshall 
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Huntsville, Alabama 35805 


Abstract -- The use of a decision tree to 
represent a decision making problem is well- 
known. Current methods for examining the entire 
decision tree are too time consuming. One way 
to overcome this difficulty is to use a parallel 
processor computer. A brief description of the 
capabilities of a parallel associative processor 
is given and performance results for a tree 
search algorithm are included. Results indicate 
that the parallel approach examines the tree 
much faster than the previously used sequential 
algorithms. 


Introduction 


Many decision processes may be expressed as 
a decision tree as shown: 


This tree is composed of a set of nodes and 
branches where each node represents the selection 
of an alternative and the branch leading from a 
node represents the decision made. Here, each 
decision is assumed to have only two possible 
alternatives. Since a tree with N decisions has 
a total of 2N different combinations, a problem 
with more than 50 decisions is impractical to 
approach directly in this format. 


In general terms, a problem has some objec-— 
tive or goal which the decision-maker sets. His 
difficulty lies in the fact that he is attempting 
to reach his goal while not breaking a set of 
predetermined restrictions. This problem is 
usually described as 


Maximize 
Subject to: 


F(X) 
G(X) < B 


where the problem is specified in terms of the 
decisions X. 


A sequential algorithm for the solution of 
this problem was developed by Balas in 1965 [1] 
and uses the decision tree approach. Essentially, 
the algorithm searches the entire solution tree 
one node at a time to identify those decisions 
which allow the decision-maker to maximize his 
objective while remaining within the restrictions. 
As the algorithm progresses in its search, 
information collected about the problem is used 
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to discard portions of the solution tree before 
they are examined in detail. Hence the name 
"implicit" enumeration. Studies have shown that 
implicit enumeration is practical only for 
problems with less than 100 to 150 decisions. 
Even for small problems with 50 variables or 
less, the solution time for the sequential 
algorithm may be several minutes. 


The Problem 


An example of a decision tree structure is 
to be found in the resouce allocation problem of 
assigning interceptors to targets or assigning 
returns to known targets. Leal [4] has described 
the use of a decision tree structure to the 
problem of interceptor allocation. He proposes 
that the decision tree formulation be used in 
combination with artificial intelligence techni- 
ques to "teach" a program to respond quickly 
during a BMD attack. This prior "learning" by 
the defense system monitor would allow many poor 
alternatives to be discarded and enable the 
system to perform at a high level of effective- 
ness. In order to develop this artificial 
intelligence program, the learning program must 
investigate the entire solution tree many times. 
However, since the current method used to search 
the decision tree is a sequential search over 
the tree where only one decision is examined at 
a time, solution times for a problem with many 
alternatives are excessive. Thus, any model of 
sufficient complexity to be useful will have a 
decision tree too large to be processed in a 
reasonable amount of time using a sequential 
computer. 


A Parallel Processing Solution 


Morefield [6,7,8] has proposed that one 
potential means to overcome this problem lies in 
the use of parallel processors to search the 
decision tree. In order to study the tree 
search methods using a parallel associative pro- 
cessor, an algorithm has been developed which 
examines many nodes of the solution tree simulta- 
neously [5]. This algorithm is based on the 
sequential search method which has been studied 
extensively in the literature [1,2,3]. Essen- 
tially, the algorithm uses a modification of the 
sequential algorithm to generate many candidate 
nodes simultaneously. Each candidate is assigned 
to a processing element. All processors perform 
exactly the same arithmetic operations in exan- 
ining their own candidates. Once a "good" 
solution is found, that information is shared 
among all the processing elements to efficiently 
search the tree. In general, the solution tree 
tends to grow exponentially as the algorithm 
progresses. A mechanism is included in the 
algorithm to recognize a limited number of 


processing elements and to function within 
that limitation. 


In order to study the parallel algorithm, a 
program has been written which simulates the 
gross functional characteristics of a parallel 
associative processor using the parallel algo- 
rithm. The basic criterion used for comparison 
of the sequential and parallel algorithms is the 
number of algorithm (sequential or parallel) 
iterations required to find and verify the 
existence of an optimal solution. Preliminary 
studies indicate that a parallel associative 
processor can solve these decision problems in a 
fraction of the time required by a sequential 
processor. Representative results with the 
parallel algorithm on a test problem of 
Petersen's [9] are given in Figure l. 
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Results 
Table 1 shows the experimental estimates 
obtained for the solution of several test 


problems. 


Table 1. Experimental Results 


Numb 
Problem {Deci 


.003 


er of Sequential Parallel Algorithm (Number of PEs) 
sions Algorithm 2 5 10 
6 002 .002 -001 001 


Estimated solution times were obtained by 
multiplying the sequential algorithm solution 
time by the ratio of the number of parallel 


algorithm iterations to the number of sequential 


algorithm iterations. Times were obtained using 


a CDC 7600 computer. 


In general, the parallel solution method 


becomes more attractive as the number of pro- 
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cessing elements increases. However, tests show 
that each problem has a certain limit beyond 
which the addition of processing elements has 
little effect on the algorithm performance. If 
the sequential algorithm required M iterations, 

a parallel associative processor with N pro- 
cessing elements would solve the problem in less 
than M/N iterations. In fact, as the number of 
decisions increases, the solution rate becomes 
much less than the M/N ratio. The reason for 
this lies in the way in which the decision tree 
is built as shown in Figure 2. The sequential 
algorithm essentially builds a "tall" tree in 
that one branch is examined in depth, whereas 

the parallel algorithm builds a "wide" tree in 
that many branches are examined simultaneously. 
The process of building a "wide" tree enables 
the parallel algorithm to discard "unfavorable" 
alternatives faster than the sequential algorithm. 
The performance of the sequential algorithm 
would be competitive with the parallel algorithm 
only if a "good" solution is found in the 
extreme upper left side of the decision tree, | 
i.e., the sequential algorithm builds the decision 
tree top to bottom and left to right. The 
parallel algorithm builds the tree top to bottom. 


Sequential Parallel 
Algorithm Algorithm 
Figure 2. Comparison of Tree Construction 


After Only Three Algorithm 
Iterations 


Machine Architecture 


The architecture of a parallel processor to 
use the parallel algorithm is extremely simple. 
Essentially, the machine would consist of a 
control unit and an ensemble of processing 
elements as shown in Figure 3. The control unit 
would require a small memory and would need to 
block transfer a large number of words to all 
elements simultaneously. The processing elements 
would require a parallel indexing capability in 
order to implement the parallel algorithm effi- 
ciently. Element memory: would be on the order 


of one thousand (decimal) words. The ensemble 
should contain at least as many processing 
elements as the number of decisions in the 
original problem. 


CONTROL 
UNIT 


Parallel Processor Architecture 


Additional Questions 


Additional research is required to describe 
the best way to use the parallel algorithm. 
Since each processing element is working its own 
independent problem, is there some way to share 
information to make the search more efficient? 
Can some internal measure. of how efficiently the 
parallel algorithm is working be developed? When 
a "good" solution is found by one processor, how 
may the other processors make the most efficient 
use of this information? These are only some of 
many additional questions which remain to be 
considered concerning the parallel algorithm. 


~- Figure 3. 
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Conclusions 


A parallel algorithm for use on a parallel 


associative processor has been developed to 


search through a decision tree. 


Preliminary 


experiments indicate that the solution times 


using this method are much better than those 
the sequential method. 


of 
This results enables 


large decision problems to be solved exactly 
rather than relying on some search heuristic 
which may or may not lead to an acceptable 
solution. 
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PARALLELISM IN SORTING 


Franco P. Preparata 
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Abstract 


In this paper we describe a family of paral- 
lel sorting algorithms for a multiprocessors sys- 


tem. These algorithms are enumeration sorts, i.e., 


they are based on subdividing the keys into sub- 
sets and determining for each key the number of 
smaller keys (count) in every subset. The novelty 
is that parallel merging is used to implement the 
acquisition of the counts. By using Valiant's 
merging scheme, n keys can be sorted in parallel 
using nlogyn processors in time Clog.n; if memory 


fetch conflicts are not allowed, then for 


0 <a 1 sorting on ae processors runs in time 
(C'/a) logon + o(log,n). 


1. Introduction 


The efficient implementation of comparison 
problems, such as merging, sorting, and selection, 
by means of multiprocessor computing systems has 
attracted considerable attention in recent years. 
One of the earliest fundamental results is due to 
K. E. Batcher [1], who proposed a sorting network 
consisting of comparators and based on the prin- 
ciple of iterated merging; as is well-known, such 


scheme sorts n keys with OCa(loeny-) comparators 


in time 0((logn)*). Batcher's network is readily 
interpreted, in a more general framework, as a 
system of n/2 processors with access to a common 
data memory of n cells: obviously, the network 
structure induces a nonadaptive schedule of memory 


accesses. After the appearance of Batcher's paper, 


substantial work was aimed at filling the gap be- 


tween the upper-bound 0( (logn) *) on the number of 
steps which is achievable by a network of compara- 
tors and the lower-bound O(logn); the lack of 
success, however, convinced several workers to 
look for more flexible forms of parallelism. 


The first scheme shown to sort n keys in 
time O(logn) is due to D. E. Muller and F. P. 
Preparata [2], but it requires a discouraging 


number of 0(n*) processors. Subsequently, new re- 
sults were obtained on parallel merging by F. 
Gavril [3]. L. G. Valiant [4] must be credited 
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with addressing the fundamental question of the 
intrinsic parallelism of some comparison prob- 
lems and with the development of faster algo- 
rithms than were previously known. In particular, 
in [4] he described an algorithm for merging with 
mm processors two sorted sequences of n and m 


keys, respectively, (n<m), in 2loglogn+0(1) 
comparison steps; this algorithm can then be ap- 
plied to sort n keys with n processors in 
2logn*loglogn+0(logn) steps. His method assumes 
a computational model in which there is no pen- 
alty for memory-processor alignment and the over- 
head,corresponding to the reassignment of sets of 
processors to subsequences to be merged, is 
ignored. 


A new family of sorting algorithms has been 
recently discovered by D. Hirschberg [5]. Assum- 
ing as a computation model a parallel processing 
system of the SIMD type (single-instruction 
stream, multiple-data stream) with random access 
capabilities to a common memory, Hirschberg shows 
that n keys can be sorted in time 0(k logn) with 


gltl/k processors, where k is an arbitrary in- 
teger = 2. These schemes are not free of memory 
fetch conflicts (simultaneous reading of the same 
location by more than one processor) and Hirsch- 
berg poses as an open question the possibility of 
achieving analogous performances without memory 
fetch conflicts. 


In this paper we shall present two results. 
The first, discussed in Section 2, is an algo- 
rithm for sorting n keys in time C logn (where C 
is a constant) with nlogn processors: this algo- 
rithm combines a number of known techniques, and 
makes crucial use of Valiant's merging algorithm. 
The second result (Section 3) is a family of very 
Simple sorting algorithms, which have the same 
running time as Hirschberg's, but use basically 
different techniques and are entirely free of 
memory fetch conflicts. As our computation model 
we adopt a system of several identical processors, 
each capable of random-accessing a common memory 
with no alignment penalty. Store, fetch, and 
arithmetic operations have unit costs, and fetch 
conflicts are disallowed when appropriate. 


All of the algorithms described in this 
paper - as well as Hirschberg's [5] - are in- 
stances of enumeration sorting, in Knuth's termi- 
nology ([6], p. 73). In these methods each key 
is compared with all the others and the number of 
smaller keys determines the given key's final 


(1) 


Throughout this paper "log" means 
"logarithm to the base 2". 


position. Specifically, three distinct tasks are 
clearly identifiable in enumeration sorting 
algorithms: 

(i) count acquisition. The set of keys is 
partitioned into subsets and for each 
key we determine the number of smaller 
keys in each subset (this informal de- 
scription momentarily assumes that all 
keys are distinct) ; 

(ii) rank computation. For each key the sum 
of the counts obtained in (i) gives the 
final position (rank) of that key in the 
sorted sequence; 

(iii) data rearrangement. Each key is placed 
in its final position according to its 
rank. 

Less informally, an enumeration sorting scheme 
has the following format, where we assume for 
simplicity that, for some given integer r, n=kr. 
Data structures to be used are arrays of keys. 
By ALi:j] we denote a sequence ALiJALi+1]...ALj]. 


Input: A[O:n-1], the array of the keys to 
be sorted, integer r 
Output: ALO:n-1], the array of the sorted 
keys. 
1. begin Define A 0peshie ALir: (i+1)r-1], 
for i=0,...,k-l. 
(lta. chy fa. Ch] <a,[£]}| for j <i 
4 2G) _ i J J 1 
; 4 ; 


} 
r 


i |{4, (a) |a, Ch] <A,[2]}| for j >i 


cft#) ~ |f{a,Ch] |a,fh] < a,Le],n < 2} 
fa,[h] |a,Ch] <a [4], > 23] 


Kel ju 
3 rank(A,L2])- = ow 
L 4=0 
4. Al rank(A,[£])] - A [2] 
end 


Note that count acquisition, rank computation, and 
data rearrangement are performed, respectively, in 
steps 2, 3, and 4. Also, the algorithm must in- 
sure that all ranks be distinct, which is a cru- 
cial condition for the data rearrangement task 
(otherwise memory store conflicts would occur). 
This clearly poses no problem when the keys are 
all distinct. In the opposite case, some conven- 
tion must be adopted for the ordering of sets of 
identical keys. One such convention is that sort- 
ing be stable (see [6], p. 4), that is, the initial 
order of identical keys is preserved in the sorted 
array. Thus, all of our sorting schemes will be 
stable. This is reflected in the rules for the 


computation of the parameters a in Step 2 of 
the above algorithm. 


The simple algorithm proposed by Muller and 
Preparata in [2] is a crude example of enumeration 


sorting, in which the sets A, are chosen to be 
singletons. With this choice, each key is com- 


pared with every other key, thereby asine-OCa-) 
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processors; similarly, rank computation uses 


2 : 
O(n’) processors, since O(n) processors are 
assigned to each key. The time bound O(logn) is 
due to Step 3 (counting in parallel the number of 
1's in a set of n binary digits), whereas Steps 2 
and 4 run in constant time in our present model. 


In the more complex procedures to be later 

described, the operations of rank computation 

and data rearrangement are essentially carried 

out as in the basic scheme described above. The 
main difference occurs with regard to count acqui- 
sition. In the Muller-Preparata method the counts 
are acquired by comparing each key with every 
other. The comparison of two keys ALi] and ALi] 
could be viewed as merging ALi] and ALj]. Suppose 
now that, rather than dealing with single keys, we 
deal with sorted sequences of keys A lo:r-11 and 


A. [0:r-1], where r > 1 and, say j <i. We easily 
realize that the number of keys in A.L0:r-1] 


which are no greater than AL] (£=0,...,r-1), as 


well as the number of keys in A, (O:r-1] which are 
less than Ath] (h=0,...,r-1), can be obtained by 
merging the two sequences A, LO:r-1] and A,lO:r-1]. 


In fact,let BLO:2r-1]be the array obtained by merg- 
ing the two sorted arrays A,.[):r-1] and A, [O:r-1], 


with. the ordering convention A ts] s [s+1] 


(k=i,j) and BLs] < Bls+1]. Suppose also that the 
merging be stable, that is, the order of identical 
keys in the concatenated array A [0:r-1]Ja,lo:r-1] 


is preserved in BL0:2r-1]. If Blq] = A [4], then 
there are (q-%) entries of A,LO:r-1] in BLO:q-1] 
which are no greater than A, £]; similarly if 
BLq] = A, (hl, then there are (q-h) entries of 

A, [0:r-1] in BLO:q-1] which strictly less than 

A lh]. This is central idea of the algorithms 


to be described. 


2. A Fast Parallel Sorting Algorithm 


In this section we assume that in our compu- 
tational model memory fetch conflicts are permit- 
ted. To provide the feature required by Valiant's 
merging algorithm, that a key be simultaneously 
compared with several other keys, we may assume 
that the processors have broadcast capabilities. 
The only overhead we shall neglect is the re- 
assignment of processors to the operation of 
merging pairs of subsequences, as occurs in 
Valiant's method [4]. Notice that this model of 
parallel computation coincides with that required 
by Valiant's merging algorithm. 


We assume inductively that the following 
algorithm, SORT1, for p <n requires at most 
lplogp| processors to sort p keys. Since SORT1 
is recursive, the following presentation consti- 
tutes a constructive extension of the inductive 
step to the integer n. The induction can be 
started with n 2 4. 


Algorithm SORT1 


k -— Mognl, r<— tn/!lognl | 


Define arrays SL0:k;0:k;0:2r-1] and R[0:k; 
O:k;0:r-1] (three- dimensional arrays) 


and A, LO: r-1] ~ Alir: (i+1)r-1] (i=0,..., 
k-1), A, LO:n-kr-1] ~ Alkr:n-1]. 
Comment: When n=kr, array A, is obviously 


vacuous. Array S is defined for simpli- 
city as having 2r(k+1)“ cells, although 
the algorithm will only make use of the 
cells SLi;j:4] for which i < j. 


A, LO:r-1] “ soRT(A,[0:r-1]) (i=0,...,k-1) 
A, [O:n-kr-1] + SORT(A,[0:n-kr-1]). 


Comment: This step is a parallel recur- 
sive call of SORT1 and it involves sort- 
ing in parallel k sets of r keys each and, 
possibly, one set of (n-kr) keys. By the 
inductive hypothesis it uses at most 
klrlogr| + L(n-kr)log(n-kr)] processors. 
Since n-kr < !logn!, the number of proces- 
sors used is less than Mognl -| La/ lognl |. 


log Ln/ Mognl J | + Lltogni log!llogn! | 


= nlog(n/ lognl) + Mognl log Mognl 

= nlogn-log!lognl (n-[lognl) < nlogn-1 
< |nlogn], for n 23. For the sake of 
uniformity, array A, is now extended to 


size r, where each cell of A (n-kr:r-1] 


is filled with a dummy sentinel larger 
than any key. 


s[i;j;0:-1] « a,[0:r-1](4=0,. .5k-1; 


jzitl,...,k) 

en 2r-1] cA, [O:r-1](i=0,...,4-1; 
j=l,...,k) 

Comment: This is a copying operation 
whose objective is to obtain 
SLis;4;0:2r- 1]=a, Lo:r- Ja, [O:r-1] for all 


pairs (i,j) sith i<j. = our model, 
this operation could be done with maxima 
parallelism. However, using only k+l jr 


processors, the (mpl) elementary copy- 


ing operations are completed in two time 
units. For later convenience we assume 
that the record associated with key A,[4] 


contains a LABEL consisting of the pair 
of integers (i,2). 

SLi;4;0:2r-1] — MERGE (SLi;4;0:r-1], 
S[isj3r:2r-1]) (i=0,...,k-1;j=it+l,...,k) 


Comment: This step uses Valiant's merging 


algorithm and runs in time C, loglogr, for 
some constant Cy» using (cs Je processors. 


The original version of Valiant's merging 


algorithm can be readily modified, so that, 


whenever two keys are identical the 
indices of their respective subarrays are 
compared. 


66 Let (x,£) © LABEL SLi;j;q] 
If x=i then rLi;j;%] <— q-% else 
RLJ3i34] — q-2 
(i=0,...,k-1; j=it+l,...,k; q=0,...,2r-1) 
enews Ke k=O sean hak) > 


7. RLisi;2] — & (i=0 
: Comment: Steps 6 and 7 complete the count 
In fact after Step 7 


(ij) 
] 


acquisition task. 


the content of RLi3;j34] is C » in the 


terminology of Section 1. Step 6 can be 
executed in two time units using kyl r 


processors, whereas Step 7 uses (k+l)r 

processors and runs in one time unit. 
k-1 

8. rank(A, [e]) « © rlisj;4] (i=0,...,k; 

4j=0 

£=0,...,r-1) 

Comment; This step implements the rank 

computation. For any pair (i,4£) the sum 

can be computed with L(k+1)/2] processors 

in time Mog(k+1)| =~ loglogn. The total 

number of processors used is therefore 


nL(k+1)/2J. 
9. Al rank(A,[2£])] ~ A,[4] (i=0,...,k; 
£=0,...,r-1) 
end 
To complete the analysis of the algorithn, 


we observe that none of Steps 4-7 uses more than 
(kg processors, but 


eGeD) 2: faye JFrogel (~ eel?) oo! foal 


Also, Step 8 uses nL(k+1)/2J <= n( Mlognl+1) /2 


Since for all n24 (nl Lognl+1) /2 < [nlogn], the 
inductive hypothesis on the number of processors 
is extended. 


Finally, let T(n) denote the running time of 
the algorithm for n keys. Since r= n/logn we 
obtain 


3 
It is easily veri- 


oe ea 
T(n). Tle? + C, loglogn + C 
for some constants Cy and C3. 
fied that a function of the form C, (logn)+o(logn) 


is a solution of the above recurrence. 


3. Parallel Sorting Algorithms with no 
Memory Fetch Conflicts 


We shall now consider a family of algorithms 


for sorting n numbers in parallel with oo pro- 


cessors (0 <a@< 1) in time (C'/qa)logn+o(logn), 
for some constant C'. Each of these algorithms 
has the same performance as the corresponding 
algorithm by Hirschberg [5], although no memory 
fetch conflict occurs in this case. Again, we 
make the inductive hypothesis that for p <n, 


Algorithm SORT2 requires rel processors to sort 
p keys. The format of SORT2 closely parallels 
that of SORT1, with a few crucial differences to 
be noted. 


Algorithm SORT2 


begin 
ke fo"), x La/fn7 1] 


Define arrays S[0:k; 0:k; 0:2r-1], 
RLO:k;0:k;0:r-1] 
and A, lo:r-1] - Alir: (i+1)r-1] 


(i=0,...,k-1), A [O:n-kr-1] - Alkr:n-1]. 
A, LO:r-1] - SORT2(A,L0:r-1]) (i=0,...,k-1) 


Comment: This parallel recursive call of 


SORT2 sorts k sets of r keys each and, pos- 


sibly, one set of n-kr < k keys. By the 


inductive hypothesis, at most ee 
(asieey is N processors are used. 

n-kr<k, then Neer 4 (nter) -e® = ke (r?-k%) 
tn‘k*. Also kr=!nl- Ln/ mn | =n, whence 


' ee 
N<n(r?-k 4k) “nent eee _ 2a here 
nino 


Since 


we have used the approximation r > 
Steps 1-3 are analogous to the correspond- 
ing ones in SORT1; however, the copying 
operation implemented by Step 4 of SORT1 
must be considerably modified, as shown by 
the following Steps 4-6, to avoid fetch 
conflicts. Here again, A, is extended to 
size r as in SORTI1. 


SlLi;k;0:r-1]  A,LOie=1] (i=0,...,k-1) 
S{0;j;r:2r-1] ~ A,lO:r-1] (j=l,...,k) 
for m0 step 1 until J log(k+1)] - 2 do 
slis7-27:0:r-1] < S[i;4;0:r-1] 
(j=k-2"41,...,k3;1=0,..., 4-2-1) 
Slit2";43r:2r-1] = SLisj3r;2r-1] 
(1=0, 2.52 7-15 j=i4+ 241, ...,k) 
Let | log(k+1)] - 1 = v. 
sli;4-2”;0:r-1] — sli;4;0:r-1] 
(j=2+1,...,k;i=0,...,5-2"-1) 
s(it2”353r:2r-1] — slisj;r:2r-11] 
i=0,...,k-2”-1; j=i+2"41,...,k) 
Comment: Steps 4-6 jointly replicate each 
A,[O:r-1] the required number k of times. 
Step 4 is an initial copy; Step 5 consists of 


(logik+1I1-1) stages, each of which doubles the 
ranges of the indices; Step 6 accounts for the 
fact that k may not be a power of 2 and com- 
pletes filling the array S. Clearly this 
copying operation is implemented in 

loglk+1141 ~qlogntl time units. A straight- 
forward analysis shows that the largest 

number of processors used in any of these 
stages is at most 5/16 of the total number 


(k+l )2r of cells of § to be filled. It is 
alo easily shown that (5/16) (epi) 2r = (5/16) 


1+ 


Ginn <n” for any n 21 and a> O. 
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7. SLi;4;0:2r-1] - MERGE (SLi;j;0:r-1], 


SLi;j;r:2r-1]) 

(i=0,...,k-1;j=itl,...,k). 
Comment: This step uses a stable version 
of Batcher's merging algorithm [1], which 
is easily obtained by requiring that when- 
ever two identical keys are encountered 
their subarray indices be compared. The 
following facts about Batcher's merging 
algorithm are well-known: (i) no fetch 
conflict occurs because at any stage (or, 
time unit) each key is compared with 
exactly one key; (ii) Gua ~ [ (n%41)n%/2). 

l+a 2 

<n processors are used; (iii) 
merging is completed in logr ™ (1-q@)logn 
time units. 


l-a 
n 


8. Steps 8, 9, 10, and 11 of this algorithm 
are respectively identical to Steps 6, 7, 
8, and 9 of SORT1 and are therefore 
omitted. The latter are clearly free of 
memory fetch conflicts. The analysis of 


SORT1 showed that at most max((kg1)r, 


n L(k+1) /2J ) processors were used in any. 
of those steps. In the present case, we 


have already shown that (ee < he 
2 


similarly we conclude nL(k+1)/2J S 
n(n°+1)/2 < ae 


From the performance viewpoint, all steps 


of the algorithm require at most oo processors, 
as postulated. This extends the inductive 
hypothesis on the number of processors used by 
the algorithm. As to the running time T(n), we 
note the following: Steps 4-6 jointly require 
wvlogn+1 time units; Step 7 requires (1l-qa)logn 
time units; Step 10 requires awlogn time units; 
Steps 8,9, and 11 run in constant time. Since 
Step 3 is a recursive call of SORT2 on sets of 


1- : 
r=n °* elements, we obtain for T(n) the 
recurrence equation 


T(n) = tn") + (Cj a+C,) logn + C. 


for some constants Ci> C,, and Cy. 
verified that a function of the form 
[ (Cha + C4) /a}logn + o(logn) is a solution of 


It is easily 


this equation, whence T(n) <= (C'/a)lognto(logn). 
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Abstract 


We consider the problem of triangulating a 
sparse matrix in a number of steps such that in 
each step all of the arithmetic operations that 
can be done in parallel are so executed. Our 
object is to minimize the number of such steps 
and at the same time to minimize the number of 
such operations. These two requirements are not 
compatible and both depend on the ordering of 
the matrix. A reordering algorithm which is a 
compromise is proposed. For a given ordering, 
an algorithm to sequence the operations in order 
to complete the triangulation in minimal number 
of steps is presented and bounds on the number 
of processors required are given. Experimental 
results on matrices of order 500 are reported. 


Background 


The triangulation of an nxn sparse matrix 
A= la54] consists of a series of steps each of 
which requires one of the following two sets of 
arithmetic operations: 


For k=1, 2, ..., n-l and for each 


a4 ¥ 0 


jg > dk (1) 
and for each pair ay aK #0 

a..*«a..-a., *a,. Lok, 

ij ij ik kj 4 >k (2) 


A total of 2n - 2 sequential steps are re- 
quired to triangulate A. We shall call (1) 
the divide operation and (2) the "update" 
operation. In (2) if aj = 0 but 


ask a4 5 #0, a fill-in is generated. It is 


obvious that if we have sufficient number of 
processors, the divide operations for each row 
k can be done in parallel. Also, for each k, 
the update operations for all pairs 

aj ae # 0 can be done in parallel. More- 


over, if A is very sparse, it is possible that 
the divide and update operations of several rows 
be done at the same time. The total number of 
steps to triangulate A might therefore be less 
than 2n -2. 


*Research supported by NSF ENG-76-02870. 
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For example, in the following matrix 


Z 3 
kL ix x Xx 
2 x xX 
3 x xXx xX (3) 
Ak ee UCU 


the divide operations of rows 1 and 2 can be done 
Simultaneously in step 1. In step 2, the update 
operations of Ano» Agy and azz can be done 


In step 3, azn 
by azz and azn is updated. Finally in 
Ad is updated. 

triangulated in four steps. 


at the same time. is divided 


Step 4 a Thus the matrix is 


In this paper, we propose an algorithm to 
compute the minimum number of steps for tri- 
angulating a sparse matrix once the ordering of 
A is given. We next give upper and lower 
bounds on the number of processors required. 
Lastly, we present an algorithm to find an 
optimal ordering of A to meet the two require- 
ments of minimization of the number of arithmetic 
operations and the number of triangulation steps. 


Triangulation Graph 


In order to expose the parallelism among 
the divide and update operations, we use a unit- 
execution-time model [1] to represent the tri- 
angulation process. This model is defined as a 
directed, acyclic graph, G(V,E), with node 
set V and arc set E defined as follows. 


V= {vs lv, represents either a divide or 
update operation, 
1 de 2, Gaeg 1 


The total number of operations is m and we 
assume that it takes a processor one time unit 
to execute an operation, and no preemption is 
allowed. 


E ={(v;; a V, veN i#j, and 


the operation represented by 
-v. needs the results of the 
operation represented by 


v;-3 


In the graph, an arc goes from v, to a 


Implicit in the definition of E is a set 
of precedence relations which are specified by 
a sequential description of the divide/update 
operations. Corresponding to each sequential 
description is a graph G(V,E), which we shall 
henceforth call a triangulation graph. 


As an example, consider the following matrix 


ct «2 6 Be Be *6 
1 |x x 
2 x xX 
3 xXx xXx x 
4 xX x x (4) 
5 tx x xX 
6 x x x 
7 x x x 


One possible sequence of operations to tri- 
angulate the matrix is listed in Table l. 
triangulation graph is shown in Fig. l. 


Its 


The parallelism that exists among the oper- 
ations is now clear from the graph. Operations 
1, 3, 8 and 7 can be done in parallel in one 
step. Operations 2, 12, 4 and 11 can be done 
in parallel in the next step, and so forth. 
is also clear from the graph that the minimum 
number of time steps to triangulate the matrix 
15” 7. 


It 


We shall define the length of a path of 
G(V,E) as the number of nodes from the starting 
node to the end node and we let D denote the 
length of the longest path of G. D then is 
the minimum number of steps to complete the tri- 
angulation according to a sequence of operations 
implied by E of G(V,E). For the example 
shown, D= 7. The triangulation in the order 
given results in two fill-ins. 


Now suppose that we reorder the matrix as 
follows. 


1j;x x 
5 |X x xX 
7 x xX Xx 
4 x xX xX 
6 % . -& (5) 
3 x =X. ok 
2 x xX 
No fill-ins are created but now D=12 if we 


sequence the operations row by row from top to 
bottom and colum by column from left to right. 
Thus the minimum number of time steps and the 
number of fill-ins and hence arithmetic opera- 
tions depend on the ordering of the matrix. 


Rearrangement of Nodes 


In Fig. 1, both V6 and Vg are update 


operations on a,,. From the expressions: 
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V6? 966 * 466 ~ 363 436 (6) 
Vo? 466 * 466 ~ 864 246 (7) 


we see that regardless in which order these two 
operations are executed, the final A66 used in 


from the 
longest path and place it between Vo and Vo» 


Vis will be the same. If we remove Vo 


we obtain the same final triangulation but D 
will be reduced by 1. In the sequential descrip- 
tion of the operations, operation 6 of Table 1 
is now placed after operation 9. The new graph 
is shown in Fig. 2. Thus by postponing opera- 
tion 6 we succeed in reducing the triangulation 
steps by one. 


| In general, before a matrix element is sub- 
ject to a divide operation, it may have to be 
updated several times, i.e. 


Q 
yy a 
k=1 


a Agee ik kj (8) 
Numerically, it does not matter. which product in 
the summation is subtracted from aj first. 


Our object is to find a sequence which minimizes 
D. In the next three sections, we describe how 
this is done. 


Depth of a Node 


In G(V,E) we shall call a node without 
any predecessor an initial node. Nodes V}> Ve> 


Vz and V4 in Fig. 2 are initial nodes. The 


operations they represent are available for exe- 
cution in the first time step. 
as the 


We define the depth d_. of node Ve 


maximum path length from an initial node to Vy 


and it signifies the earliest time at which the 
operation of v, can be executed. The depths 


of the nodes of Fig. 2 are shown in Table 2. 
Clearly the depth of the termination node of G 
is D. 


Operation Set and Depth Set 


We have seen that a matrix element a5 
in general subject to a series of update opera- 
tions. It is convenient to associate with each 


a.. an operation set .. anda depth set A.. 
i an ae an a 


defined as follows. 


is 


if and only if opera- 
tion Vi applies to 


aij and we write 
255 = {oe Vyr ee eV 
..} if oper- 
ation Vv, pre- 
cedes operation 
V5 
| J 
As3: dé 4s, if dq. is the depth 
of VEN; 


For example, from Table 1 and Fig. 1 we have 


2 6 {vy} Ac = {1} 

22 {vy} Azz = {2} 

Qe 6 = {Vp ,Vg} hee = {4,5} 
249 = {Vp 92Vy4Vy 6} hog = {2,4,7} 


On the other hand, in Fig. 2, we have 


N66 = { Vg»V} hee a Gres 


Qa = {Vy 99VyqV 6h Aug = {2,4,6} 


Thus, by altering the sequence of operations of 
Nee the depth associated with each operation 


may be changed and in this case the final D is 
reduced from 7 to 6. 


Given an ordering of the rows, the order in 
which the matrix elements are processed is 
fixed. However, within the set of operations 
applied to each matrix element it is possible to 
vary the sequence of update operations to ob- 
tain a different triangulation graph, and hence 
a different number of steps necessary to tri- 
angulate the matrix. 


Minimal D 


In the following, we give an algorithm by 
which for a given ordering of A_ the update — 
operations in each 05 are sequenced such that 


D is minimized. The sets Q: 3 and Aj. are 


constructed at the same time as the ordering of 
A is generated. We denote the ordering of the 


rows by p(1), p(2), ---, pm). 
Algorithm 1 
(i) Input: Matrix A. We assume all 
| | diagonal elements and 
pivots are nonzero. 
(ii) Output: An ordered sequence of 


operations in each Qs 5s 
the depth sets and D. 
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(iii) Initialization: 2. 5 ={o} for all i,j 
{0} forall i,j 
0 


A. . 
1J 
Yr 


(iv) Procedure I 
begin 
for k<« 1 
egin 
call Procedure II (described in Ale 
gorithm 2) to determine the k 
pivoting row; let it be row q; 


p(k) « q ; 
for each j such that a_. #0 and 


j €{p(1),p(2),...,p(k)} 
do 
begin 
r«+rl; 
v.+ a label assigned to the di- 


step 1 until n-1l do 


vide operation on a j : 


d+ Max {A U A.) 1; (A) 
43 * 4 , Utd} ; 
245 + 25 U{v,} 
end 
for each i such that a,, #0 and 
each j such that a. #0 and 
i,j €{p(1),p(2),...,p(k)}, do 
begin 
r<rl] ; 


ee label assigned to an update 
operation on aes 3 


d.. + Max t As, U Aa? +1 3; (©) 
while Jd'€A.. , dd'=d. do 
are ij Bt protkils 
egin 
d_. + d. +] (C) 
end 


call Insert(d,,A, ;) (Insert d_. 
into As so that déA., 


are in ascending order.); 
call Insert (v, 5; 5) (Insert r 
into .. so that v_€Q.. 

i saa © 


J 
are in the same order as 


dé@A...) 
1jJ 
end 


end 


~ p(n) + q such that q &{p(1),p(2), 
..-p(n-1)}; 


ee Max { A, (7) p(n)! 
end 


(v) Comments: 


Statement (A) signifies that the divide oper- 
ation on an element a_. must take place at 
least one time step later than the latest opera- 
tion on the pivot or itself. 


Statement (B) says that an update operation 


on a,. should take place at least one step 


later than the latest operation on aig or aq5 
irrespective when the previous operation on as j 
took place. In this way we are guaranteed that 


the last operation on aj is completed at the 
earliest time step. Now, if in As; there is 
already an element with value ds then d.. is 


increased by one until it is different from every 
dea... This step is necessary because no two 


operations can be applied to the same matrix ele- 
ment at the same time. The while-loop of (C) ac- 
complishes this. 


A formal proof that D obtained from Al- 
gorithm 1] is minimal is long and would require 
the introduction of additional new concepts. It 
will not be given in this paper. A plausible 
argument that the algorithm does produce a mini- 
mal D is that each of the three key steps of 
the algorithm ensures that every operation on a 
matrix element is assigned the smallest possible 
depth. 


For the matrix of (4), if p(k) =k as 
shown, D is found to be 6 by Algorithm 1, and 
the triangulation graph is that of Fig. 2. From 
the graph, it is not difficult to determine that 
at least three parallel processors are required to 
triangulate the matrix in 6 steps. An optimal 
schedule using three processors is shown in 
Fig. 3. In this schedule, all operations having 
the same depth are not executed at the same time. 
In general, the scheduling problem is NP-com- 
plete [2]. In the following, we consider the 
bounds on the minimum number of processors. 


Bounds on the Number of Processors 


We define the level nunber, 
ve of G(V,E) as: 


U.» of a node 


u, = D+1- Max {p,;|p, is a path 


7 length from Vy 


to the terminat- 
ion node. } 


The level numbers of the nodes of Fig. 2 are 
shown in Table 2. Each level number indicates 
the last time step by which the operation 
represented by Vv, must be executed, if the 


triangulation is to be completed in D_ steps. 
Let the number of nodes in G(V,E) which 
have the same level number, say i, be Ir. 


Then an upper bound on the number of processors 
to triangulate the matrix in D steps is clear- 


ly 


B, = Nax { £|i€{1,2,...,D}) (9) 


sible [3]. 


A lower bound can be defined as 


Be = Max i=1,2,...,D (10) 


The meaning of Be as defined is as follows. 
From the previous comments on level numbers, for 


each step i, 1=1,2,...,D, all : fi. 
operations must be completed in i noe and 
at least ( * Ei J/i processors are needed. The 
maximum SF ths last quantity is then a lower 
bound on the number of processors required. 


Applying (9) and (10) to the graph of Fig. 2, 
we get Diane cat and. Be = 3. An optimal 


schedule using three processors is shown in Fig. 3. 
Reordering Algorithm 


We noted earlier that the number of tri- 
angulation steps D depends on the ordering of 
the rows of the matrix. For a given ordering, 
the total number of operations is fixed. It ap- 
pears therefore that the smaller D is, the more 
operations can be done in parallel. However, 
the strategy to minimize D alone would leave 
the generation of fill-ins and hence the number 
of arithmetic operations uncontrolled, and it is 
possible that the number of parallel processors 
required is unreasonable. Also, as the matrix 
begins to fill, less and less arithmetic opera- 
tions of different pivoting rows can be done in 
parallel and D would increase. 


Clearly the best that one can do realistical- 
ly is to do as much local minimization as pos- 
It is not easy to discern the re- 
lationship between D and reordering, and we 
propose the following scheme as a reordering al- 
gorithm. 


Algorithm 2 


Matrix A; the order of 
A, n; the first k-1 
pivoting row numbers, 
p(1) sp (2) gee »p(k-1); 
and the depth sets of 
aij that have been con- 


(1) Input: 


structed up to the time 
this procedure is called. 


The next pivoting row 
number, q. 


(ii) Output: 


(iii) Initialization: Relative weight assigned (A) is used in Table 3 and option (B) is used in 
to minimization of Table 5. Similarly for Tables 4 and 6. The fol- 
arithmetic operations, lowing remarks can be made. 

W_; and relative weight 
aie ee eC re (1) The depth D in all cases is a small 
oe W fraction of 2n-2, in fact, much smaller than 
sane 2 n. This is to say that maximum utilization of 
parallelism would significantly reduce the num- 
ber of sequential steps necessary to triangulate 
A. For example, in Table 3, colum 3, we see 


(iv) Procedure II 


begin ; that 26895 steps are necessary to triangulate A 
ss a j — that j€tp(1),p(2), if only one eee is used. The matrix can 
us )} do be triangulated in 91 steps, but more than 378 
8G) « mumber of update/divide oper- processors would be required. 
spec ss ae os packed (2) As W/W, increases, thus giving more 
for each m€{p(1),p(2),...,p(k-1), relative importance to minimization of D, more 
j} do fill-ins are generated. D does decrease for 
begin — the case with small bandwidth. If the bandwidth 
LG) « = (Max{A. } + Max{A_.}) is large, the reduction in D is counterbalanced 
Mm jm my by increase of fill-ins. | 
+ Max {A..} (A) 
se JJ (3) While there is no rational basis to de- 
L(j) « Max{A. UA. UA..} ; (B) termine what the best ratio of Wy to Le 
m a my JJ should be, our experiments indicate that going 
C(j) + W,°0Gj) + W-LG) to the extreme of W/W, = 104 does not result 
find j* such that C(j*) = in significant decrease of D. It seems that a 
Min {C(j)} ; good reordering algorithm is one which uses 
J Markowitz's scheme to find the pivots and uses 
q+ j* L(j) to break a tie, if any. 
end 
end _ (4) The results of the experiments are not 
end sensitive to whether option (A) or (B) is used. 
Conclusion 
Note that in statements (A) and (B), the depth 
sets are the current ones that have been con- We have shown that a high degree of parallel- 
structed up to the time the procedure is called, ism exists among the operations in the triangu- 
but this is the best that one can do since we lation process of a sparse matrix. Recognizing 
are only interested in local minimization. The this, we take advantage of as much parallelism 
quantity L(j) has the significance that in- as possible in every step of the process. In 
tuitively, among the pivot candidates which call contrast with array processing of matrices [5], 
for about the same number of update/divide oper- our approach assigns operations on matrix ele- 
ations, the one which requires the least number ments on different rows to parallel processors. 
of time steps to complete all of the update In this way we are able to reduce the total num- 
operations, up to the time the procedure is ber of steps to triangulate the matrix, and as 
called, of the elements on the pivoting row and we have seen, the reduction is dramatic and the 
column should be chosen as the next pivot. number can be significantly smaller than the or- 


der of the matrix. 

By adjusting the weights We and Wp» 

The important results of this paper are: 
(1) an algorithm to sequence the operations of 
a triangulation process so as to minimize the 
the reordering algorithm becomes Markowitz al- number of time steps required; (2) a lower and 
gorithm [4]. If W.= 0, reordering is based upper bound on the number of parallel processors 
. required in order to triangulate a matrix in 
minimal number of time steps; and (3) an al- 
: orithm to reorder the matrix to obtain local 
or ents finiioation of the number of operations and the 


number of triangulation steps. 


different degrees of importance are assigned to 
operation count and depth. If W, = 0, then 


on consideration of reducing D alone. 


The proposed reordering algorithm was ap- 
plied to a number of sparse matrices of order 
ranging from 100 to 500. The results of four 
cases are given in Tables 3 to 6. The matrix 
of Table 3 has a bandwidth much larger than that 
of the matrix of Table 4. Tables 3 and 5 refer 
to the same matrix except that L(j) of option 
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[1] 


[2] 


[3] 


[4] 


[5] 
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Te. ap 294 
(Agy + Age / Are 
SET 897 9s 
467 * 467 / 866 
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Table 1 


label operation 


d Uu Order of matrix: 500 
_)_ poe aes as = Number of nonzero elements: 2060 
1 a 1 7 Density: 0.82% 
15 
2 ace 2 3 
3 . l 1 Operation weight| 100.0} 100. eS 
23 
2 2-36 3 Fill-ins 2675 | 2630 | 2866 | 3594 
6 Be 6 4 4 
7 A l a Total nonzeros 4735 4690 4926 5654 
46 
8 ay7 1 3 Total operations |25943 {24811 {26895 {36308 
. 467 . B, 1143 |1089 | 1138 | 1395 
11 a Z 5 
76 B 
12 a9 2 4 u 
13 aca 3 4 Be 341 329 
14 aa A 4 ° Average: 
15 aes 5 5 Total op./D 273.1 | 263.9 
16 aoo 6 6 
Table 3 
Table 2 
Order of matrix: 500 
Number of nonzero elements: 1952 
Density: 0.78% 


Operation weight 


| 
Teo[ oa] x0 foo 
< Total nonzeros 2885 3471 

Total operations {4519 


= Papel Fal 
sLefs Paie|— 
i [Fafa | 


Depth weight 


Fill-ins 


Depth 69 
An optimal schedule of operations B 
of matrix A using 3 processors. m 
Figure 3 Ba 
‘ ibs 
Average: 
Total op./D 104.4 | 147.8 


Table 4 
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Order of matrix: | . 500 
Number of nonzero elements: — 2060 
Density: 0.82% 


Operation weight 
Depth weight | 
Fill-ins 2653 2638 © 


Total nonzeros 4713 4698 5354 
Total operations| 25943 25287 24835 31019 
: ac 1089 | 1177 


By ERE 


Average: 
Total op./D 73.1 | 274.91 264.2 
Table 5 
Order of matrix 500 
Number of nonzero elements: 1952 
Density: : 0.78% 


Operation weight 


Depth weight 


100.0 | 100.0 ae 
EOE 
aa 


2885 2924 2946 3326 
4519 4729 4788 6437 


Fill-ins 


Total nonzeros 


Total operations 


Depth 


Average: 
Total op./D 


Table 6 
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VECTOR REDUCTION OPERATIONS ON CRAY-1 
AND THEIR PERFORMANCE 


T. L. Jordan 
C-Division 
Los Alamos Scientific Laboratory 
Los Alamos, NM 87545 


summary 


The CRAY-1 computer architecture has a con- 
ventional scalar vocabulary plus a limited yet 
powerful vector vocabulary. Although the scalar 
performance of CRAY is about twice that of the CDC- 
7600 or IBM 360-195, vector performance is typical- 
ly at the 40 to 70 megaflop rate with some impor- 
tant applications such as matrix multiplication and 
Matrix inversion performing at demonstrated rates 
of 100 to 140 megaflops. 


The aforementioned vector rates are applicable 
to functions or operations which produce vector 
valued results from vector operands or mixed scalar 
and vector operands. When the results are scalar 
and functions of vector operands, i.e., reduction 
operations, then the hardware is not so explicitly 
equipped to provide the kind of speeds shown above. 
For the more important functions of the latter 
type, summing a vector, computing a dot product and 
searching a vector for its maximum or minimum, one 
asks to what extent performance can be improved 
over optimal scalar code through the use of the 
vector hardware. For some reduction operations, 
CRAY-1 has a peculiar instruction that explicitly 
assists the implementation the first two functions. 
No similar instruction is available for finding the 
maximum (minimum) value of a vector and its index. 
Hence, in addition to describing the implementa- 
tions we have used for the first two functions, we 
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present in some detail a vector search algorithm. 
The performance as a function of vector length is 
also given. 


The logic for handling the vector sum problem 
and the dot product are intrinsically the same. 
One divides up the vectors involved into m seg- 
ments including the residue segment, i.e., n = 64 
(m - 1) + RESIDUE, where n is the total vector 
length. Operating on these segments as vectors 
one obtains 64 or less partial sums. These par- 
tial sums are then collapsed to 8 partial sums 
through the use of the aforementioned special 
recursive hardware instruction. Finally, one then 
forms a scalar sum from 8 elements to produce the 
desired sum. The asymptotic time to do this is 
1 cycle per element for the sum and 2 cycles per 
element for the dot product. However, for smaller 
vector lengths this degrades to scalar rates. 


Despite the absence of explicit hardware 
aids, excellent performance for large vectors can 
be obtained through software for the problem of 
finding the index of the maximum (minimum) value 
of a vector. The asymptotic rate at which each 
comparison can be made is between 2 and 3 machine 
cycles per compare. A detailed description of 
the software technique used to achieve this speed 
is presented. 


IMAGE MAGNIFICATION 


J. M. Vocar 
Digital Technology 
Goodyear Aerospace Corporation 
Akron, Ohio 44315 


Summary 


The processing necessary to convert raw 
input data into a convenient form which can be 
analyzed by man or machine generally requires 
many calculations for each input pixel. With 
every pass of an observation satellite, thousand 
upon thousands of such input pixels are being 
received. The demand to convert this data into a 
usable form is stressing the capacity of existing 
digital (sequential) processors. 


This ever increasing amount of input data is 
forcing the image processing community to examine 
hardware in which many parameters are treated 
simultaneously; i.e., parallel processors, to 
handle the processing load. While the maximum 
processing capability of these processors is 
usually impressive, the capability can be realized 
only if the data can be delivered to the process- 
ing elements efficiently. 


To determine the suitability of a sTaran () 
parallel processor in the area of digital image 
processing, a cubic convolution interpolation 
algorithm for image magnification was implemented 
on a 2-way STARAN-B series machine. The program 
magnified an arbitrarily sized, arbitrarily- 
located rectangular subset of 8-bit pixels 
imbedded within a 512 x 512 input image into a 
512 x 512 output image. Independently specified, 
noninteger X and Y magnifications were performed 


using two-dimensional cubic convolution resampling 


procedures. 


The developed magnification program has 
practical value. Magnification is an essential 
requirement in the areas of scene examination and 
finger print analysis. The cubic convolution 
reconstruction filter can also be used as a pre- 
lude to further image enhancement techniques. 


When an image is magnified, the spaces 
in between the given pixels must be filled in. 
These intermediate pixel values are generally 
supplied using one of the three following most 
popular methods: 1) pixel replication, 2) bi- 
linear interpolation, or 3) cubic convolution. 


This paper presents a brief discussion of the 
advantages and shortcomings of each of these three 
approaches, including aliasing and roll-off 
effects. The cubic convolution approach is shown 
to be clearly superior to the other two methods. 


(a) 


T.M. Goodyear Aerospace Corporation, 
Akron, Ohio 44315 


The use of STARAN to perform cubic convolution, 
including algorithm implementation and the 
program performance is described. 


It was found that the magnification process 
could be performed in less than 0.4 seconds. 
Approximately 30 add or multiply operations were 
required for each output pixel, and about 7.5 
million of such operations were performed during 
the magnification process. 


The results of the work described show 
STARAN to be a highly efficient and effective 
tool in processing image data, e.g., in the areas 
of resampling and reconstruction. The processing 
power of STARAN, with its flexible routing and 
high band width, make it particularly adaptable 
to digital image processing. References [1]-[7] 


detail the STARAN organization and several 
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processing techniques. 
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ALGORITHM DEVELOPMENT FOR PIPELINED PROCESSORS 


P.M. Kogge 
IBM Federal Systems Division 


Owégo, NY 


summary 


Although processors using pipelined tech- 
niques have been available for over a decade (for 
example, the IBM System/360, Model 91), only re- 
cently have they become sufficiently general, with 
flexible enough controls over the pipeline itself, 
to allow development for a single pipeline of a 
whole range of complex algorithms, such as the 
Fast Fourier Transform, optimal filter derivation, 
interpolations, etc. Examples of such processors 
include the IBM 3838 the Proteus Arithmetic 
Element, the IBM 2938, and others. However, the 
development of efficient algorithms for such 
processors is greatly different than for conven- 
tional processors. This paper discusses some of 
the tradeoffs involved in the development of such 
algorithms. 


The typical system hierarchy of such proces- 
sors includes a pipelined arithmetic unit contain- 
ing adders, multipliers, etc. Operands for this 
unit come from a small high-speed "local store". 
Since there is never enough memory in the local 
store to hold the largest problems, there is 
typically a large "main store" with some kind of 
"storage-to-storage transfer" unit between it and 
"local store". Pipelining exists in at least 
three different levels: within the arithmetic 
modules themselves, within the interconnection 
of arithmetic modules and the "local store", and 
with the overlap of arithmetic operations and 
data transfers between memories. Efficient al- 
gorithms must consider all three levels. 


Clearly the major constraint on the total 
speed of an algorithm is the rate at which the 
innermost loop of calculations can be performed. 
For a pipelined processor, however, we must care- 
fully choose how many calculations to include in 
the inner loop to maximize performance. In most 
cases it is necessary to "balance" the number of 
operands accessed from the "local store" with the 
number of each type of operation occuring in one 
iteration of the inner loop in the arithmetic 
processor unit. As an example, the inner loop 
of an FFT is often considered to be what is known 
as a "2-point Butterfly," which is simply re- 
peated over and over with different data. On 
many pipelines, however, this calculation uses up 
all the "local store" bandwidth while leaving the 
arithmetic unit only 50% utilized. However, by 
expanding the inner loop to include four butter- 
flies in a certain pattern both the arithmetic 
modules and the "local store" can be 100% util- 
ized, with twice the performance of the simpler 
inner loop. 


Another related problem occurs when the in- 
ner loop involves a recurrence, that is, when a 
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13827 


series xj, - + + X, is to be computed, and x; de- 
pends on Xj-1, - - -« X{-m- Direct implementations 
of this on a pipeline are inefficient since the 


calculation of x44) cannot start until xj is com- 


plete. There are techniques, however, that allow 
many different recurrences to be "backed up" or 
"overlapped" to the point where a pipeline can 
more quickly compute them (cf. Kogge [1,2]). 


Once the exact form of the inner loop has 
been defined, the pipelined arithmetic unit can 
be programmed to execute them. Typically this 
involves starting the next iteration of the inner 
loop as soon as the current iteration has freed 
up the first stage in the pipeline. However, 
there are often "collisions" where more than one 
iteration attempt to use the same stage at the 
same time. When this occurs it is necessary to 
insert non-compute delays into the program that 
actually lengthen the time per iteration, but 
have the opposite effect of bringing the total 
number of iterations executed per second back up 
to the theoretical maximum. Such delays are 
typically implemented by leaving results in in- 
ternal registers for short periods of time. 
Davidson, et al [3] has developed some good tech- 
niques for determining optimal delay placement. 


Finally, the choice and implementation of 
the inner loop cannot proceed without concern for 
the uppermost level of pipelining - namely the 
overlap of data transfers between memories with 
the arithmetic computations performed in the 
arithmetic processor. Data sets must be segmented 
into pieces small enough to fit into "local store, 
but large enough so that the overhead involved in 
synchronizing the transfers with the computations 
is kept to a minimum. Additionally, the existence 
of data segmenting at all may force a complication 
to the inner loop of the calculations to allow 
carry-over of intermediate results from segment 
to segment. 
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PARALLEL PREFIX COMPUTATION 


ie 


Richard E. Ladner and Michael J. Fischer 
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University of Washington 
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Abstract - The prefix problem is to compute 
all the products Hp OR_e OK, for l<k<s<n, 


where © is an associative operation. Using a 
recursive construction, we obtain a product cir- 
cuit to solve the prefix problem which has depth 
exactly [1og,n] and size bounded by 4n. An 


application yields fast, small Boolean circuits to 
simulate finite state transducers. By simulating 
a sequential adder, we obtain a Boolean circuit 
for n-bit binary addition which has depth 


2[1og,n| + 2 and size bounded by 14n . The size 


can be decreased significantly by permitting the 
depth to increase by an additive constant. 
1. Introduction 
Let o 


domain D. 
for given x 


be an associative operation on a 


The prefix problem is to compute, 
pres, € D , each of the products 


X,°X O62 OK 3: , S-Kk. Sm. 


| eee”: 
By analogy with Boolean combinational 

circuits [6], [7], we consider product circuits, 
which are directed acyclic oriented graphs. Each 
node of in-degree 2 represents a product of its 
two inputs. All other nodes have in-degree 0 and 
are labelled with an integer between 1 and n. 
These are the input nodes. With each node v_ we 
associate an element of D in the obvious way. 


We consider two complexity measures on a 
product circuit N . C(N) , the size, is the 
number of product nodes in N , and D(N) , the 
depth, is the maximum number of product nodes on 
any directed path in N . For example, the cir- 
cuit of Figure 1 has depth 3, size 4 and computes 


Xj 0X3 °K, OX y Ks ‘ Nete that it also computes 
Xj OK 4 °K, OX, A X1°X, and X, °K ‘ 
Figure 1. A product circuit. (All arcs 


are directed downwards.) 


ere * 


t This research was supported in part by NSF . 
Grant No. DCR-12997-A0O1 through a subcontract 
from M.I.T. and by NSF Grant No. GJ-43264. 
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The depth of a circuit corresponds to the 
computation time in a parallel computation envi- 
ronment, whereas the size represents the amount 
of hardware required. For the prefix problem, it 
is straightforward to construct a circuit of 
size n-l , the minimum possible, but its depth 
is also n-l . Similarly, it is not difficult to 
find a circuit of depth exactly [ 10g,n] » the 


minimum possible, but the immediate recursive 


y &@) 


construction yields a circuit of size (nlogn 
In Section 2, we find a solution_to the prefix 
problem of minimum depth [log n| and size < 4n. 


The reader familiar with logical networks 
will notice many similarities between our methods 
and those used in constructing fast, linear-size 
circuits for binary addition, such as the "carry- 
lookahead" method [9]. Indeed, our methods, 
applied to the binary addition problem, yield a 
circuit which has linear size and depth 
2[logon] + 2. Brent has an adder of depth 


log n + 0(vlog,n) but has size 2 (nlogn) ale 


Krapchenko has a linear size adder of depth only 
log,n + O0(vlog,n) [5], [7]. It appears, however, 


that our circuit is quite competitive with Brent's 
and Krapchenko's circuits for small values of n. 


The construction involves two steps. First, 
given an arbitrary finite state transducer, we 
obtain in Section 3 a circuit for simulating the 
machine on inputs of length n which has depth 
O(log n) and size O(n) Applying that construc- 
tion to the simple machine for binary addition of 
Figure 2 and analyzing the constant carefully 
yields the desired result. The details are 
presented in Section 4. 


00/1 


00/0 01/0 
yh GOL 30D 338 
11/0 


Figure 2. A sequential adder. 


2. Cireuits for the Prefix Problem 


In this section, we define a family of cir- 
cuits P (a) for solving the prefix problem on 


n inputs. For each k , the depth D(P, (n)) = 


eG). SEE) = 0): . 


k + [1og.n] . The size, C(P, (n)) < 2(1 + 1/2%)n That the constructions achieve the desired 


depth follows easily by induction given the addi- 
fora ae ane eS [1og,n] - For small tional fact, also proved by induction, that the 
n , the size is substantially smaller than this last output in P, (n) has depth exactly [1og,n], 


bound would suggest. even when k>O. The correctness of the con- 


struction is also easily shown by induction and 


The recursive construction of Py@ is tc TEEPE ER the eeadee: 


shown in Figure 3, and the construction of PL (a) 
Let Ss, (2) = D(P, (n)) . Then S_ satisfies 
for k 21 is shown in Figure 4. When n=1, 


followi . 
PL) is simply a single input node and contains the following recurrences 


+ 
no products. In the figures, circles represent S, (a) 8, ([n/2]) 7 Sy (Ln/2J) [n/2] : 
concatenation nodes. n22 3 
S, (n) S,_,({n/2]) +n-1, 
nm even and n22,k213 

S,(m) = S._ ({n/2]) +n-2, 

mn odd and n23,k21 3 
> k20. 


[n/2] {n/2] 


It 
=) 


Ss, (1) 


In case n is a power of 2 , we get exact 
solutions 


S)(™) = 4n - F(2 + log,n) - 2F(3 + log,n) +1, 


S, (n) = 3n - F(1 + log,n) - 2F(2 + log,n) : 


Oo 
| and more generally, 


Figure 3. The construction of Py{n). s(n) = S(n/2") rae ae 1/21) 2a 


2(1 + 1/2*)n - F(2 + logon - k) 


- 2F(3 + logon -k)+1-k 
n inputs 


—— this line absent 
if n even 


Figure 4. The construction of 
Pn), k 21. 


cer 
Ce oemea 
ees 
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holds for all k,O<ck<s log,n . Here, F(m) 


t : : 
denotes the m 7 Fibonacci number, and 


| mm 
F(m) = 2 . where 9 - 14% and 
V5 
$ 1,4 (cf. [3]). Thus, for large n and 


is bounded by 
0.69424... 
n 


fixed k , S, (a) 


2(1 + 1/2*)n - aye 


a constant depending only on k. 
Ss, (n) are shown in Figure 5. 


eo 8 


Some values of 


» where a is 


0 1 2 3 4 5 6 7 
0 
1 
8] 12 il il ll 
" 16] 31 27 4226 26 ~« 26 

32| 74 62 58 #57 «57 «57 | 

64} 168 137 125 121 120 120 120 

128 | 369 295 264 252 248 247 247 247 

Figure 5. s(n) for n a small 


power of 2. 


When n is not a power of 2 , we do not 
have an exact solution, but it is easily veri- 
fied by induction that Ss, (2) < 201 + 1/2)n - 2, 
n 21. In fact we know that Pm) is not 


optimal for n not a power of two. For example, 
CCP, (9)) = 13 , but the circuit of Figure 6 has 


size only 12 since S, (8) = 11, and it also has 


minimal depth 4. 


Figure 6. A solution to the 
9-input prefix problem. 


ee Compute by = Y¥(dy24y) ‘ b, = 
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transducer which is a 


It is an open problem to determine just how to 
split the circuit to optimize the construction 
using the methods of Figures 3 and 4. 


There is an analogy between product circuits 
and addition chains [8], [4]. Let D_ be the 
natural numbers, © be ordinary addition, and fix 
each input to 1. Then the minimum size circuit 
to compute a number n is exactly the length of 
the shortest addition chain for m. A prefix 
circuit on n inputs under this interpretation - 
constructs each of the integers from 1 to n. 
Unlike most of the work on addition chains, we 
are interested in the depth as well as in the 
size. As with addition chains, analysis becomes 
much more difficult for n not a power of 2. 


3. Application to Finite State Machines 


A classic example of a sequential process is 
a finite state transducer (cf. Booth L1]). Given 
an input of length n and an initial state we 
show below how to compute in parallel the output 
and final state. This method leads to the con- 
struction of fast Boolean circuits that simulate 
finite state transducers. 

We use the Mealy model of finite state | 
five-tuple M = (Q,2,A,6,y) 
set of states, £ is the 
the output alphabet, 
transition function and 
output function. 


where Q is a finite 
input alphabet, A is 
6: Qx zX+Q is the 
y:QxzX-2>A_ is the 


For each input symbol a we define a 


function M,:Q*Q by qM, = 5(q,a) (The 
argument to M, is on the left.) Given an input 
word A anes ay the state ca a Pca is 
the state of M after reading Ayer ay starting 


in state q , where © denotes functional compo- 
sition. : 


A parallel algorithm to compute the output 


and final state given the input Ayaye say and 
the initial state do is: 
1. Compute M ,M_4,....M in parallel. 
a a a 
1 2 n 
2. Compute Ny = M. $ Ny = M, om. , ; NT = 
1 1 2 

M, om. ° OM. by a parallel prefix 

ol 2 n 

algorithm. : 
3. Compute qd, = do’, > do = ION, » 259d, = 4N 


in parallel. | 
V¥(Gy285)o sees 


bY = ¥(q_4 94) in pera ene 
The output is by b,.--b. and the final state 
is q.: 


Let c,(d,) be the size (depth) of com- 


puting M, - c, (d be the size (depth) of com- 


>) 
puting functional composition, C4 (dq) be the size 
(depth) of computing functional evaluation, and 

c,(d,) be the size (depth) of computing y(q,a). 


Given an input of length n 
the size and depth to compute the output and 
final state is 
SIZE < Co c(n) + (c, +c 
DEPTH < d., d(n) +d 


3 + c,)n 


+ d, + d), ; 


1 
where c(n) (d(m)) is the size (depth) of a 
product circuit to solve the prefix problem. 
(Note: we assume the state dg can be coded or 
decoded at no cost.) 


There are several ways of obtaining Boolean 
circuits from this method. One simple way is to 
represent the M's as Ss X s Boolean matrices 


where s is the number of states. Functional 
composition is Boolean matrix multiplication and 
functional evaluation is the Boolean product of a 
matrix and a vector. For this representation 
using the standard matrix multiplication algo- 


rithm and the prefix circuit Po (or La for 


fixed k ) we can construct a Boolean circuit 
for inputs of length n with linear size and 
depth (1 + log,s)log,n +d where d isa 


constant depending only on M. 


4. Application to Binary Addition 


Consider the finite state transducer A of 
Figure 2. There are three functions Ang ‘ Agi = 
Aig ; Au on states which are closed under 
composition. We represent them by a pair of bits 
g,p (for generate and propagate, respectively) 
as shown in Figure 7. The composition table is 
shown in Figure 8, and the evaluation table in 
Figure 9. 


input | function 
x y & Pp 
00 00 Z=xay 
10 01 p=xQ@y 
01 Ol 
1 1 1 0 

Figure 7. Computation of the function 


from the inputs. 


and an initial state, 


function 
BoPo 
00 01 10 
00 
function 
10 
= 7 
& = 8 Y (g, Po) 
Pp = Py A Po 
Figure 8. Composition table. 
function 
& P 
0 
state 
Ss 1 
t=e¢ev<(saA p) 
Figure 9. Evaluation table. 


From Figure 7 the inputs can be represented 
by the initial g,p pair, so we get the output 
table shown in Figure 10. 


0 
state 
t i 
z=tQ@p 
Figure 10. Output table. 
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By observation we can calculate the 


constants 
c = 2 d, = ] 
cy = 3 d, = 2 
c, = 2 d, = 2 
Cy = ] d), = | 


The basic costs for addition are SIZE < 


3c(n) + 5n and DEPTH < 2d(n) + 4. There are 
certain refinements that can be made. 


Le 


Let the input state be the constant 0. The 
evaluation table reduces to t =g . There 
is no "evaluation" so there is no need to 
compute p at the last level before step 3. 
This results in a total savings of 3n in 
size and 2 in depth, so SIZE < 3c(m) + 2n 
and DEPTH < 2d(n) +2. 


We may obtain an n-bit adder with the state 
as an additional “carry-in" input by forming 
an (n+tl)-bit adder which starts in state 0 
and uses the lowest order bits to simulate 
the incoming state. This observation leads 
to an adder of SIZE < 3c(n + 1) + 2n and 
DEPTH < 2d(n+1)+2. 


These techniques can also be used to con 


struct ones-complement adders. Because of 
the 'end-around" carry the input state is a 


0 


1 


function of the input numbers. The input 

state is computed in step 2 which makes it 
available for step 3 where it is used. In 
this case the adder has SIZE < 3c(n) + 5n 
and DEPTH < 2d(n) +4. © | 


Using the results of Section 2 and the 


observation 1 above there exists Boolean circuits 


to compute n-bit sums (with no carry in) of 


SIZE < (8 + —2-)n and 
2 


DEPTH < 21ogon + 2k+2- 


for O<k 


lA 


log n ; 


Notice that if we set k= log n then we 


obtain a circuit of SIZE < 8n +6 and 


DEPTH < 4logn + 2. These bounds are similar 


to those obtained for the "carry-lookahead" 
adder [9]. We believe that our circuit P_(n) 


for k= logon is essentially the same as the 
"“carry-lookahead" adder. 

The table of Figure 11 illustrates the 
trade-offs that can be made between size and 
depth in small adders. The numbers of Figure 11 


are based on those of Figure 5 together with 
observation l. : 


Z 3 


DEPTH SIZE DEPTH SIZE DEPTH SIZE DEPTH SIZE 


number 16 
of bits 32 


128 


14 49 
16 110 
18 235 
20 491 


22 1012 


Figure 11. DEPTH and SIZE of small adders. 
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Abstract 


A new positional binary number system was 
devised in an attempt to avoid carrying in addi- 
tion. It originated from the groupoid string 
formalism, previously shown to have the computa- 
tion universality of Turing machines and the 
parallel capabilities of cellular and bus auto- 
mata. It is uniquely defined by the natural 
conditions (a) a number is doubled by adding a 
copy, shifted one place left, to the original 
number, where digits are added mod 2, and (b) 
adding 1 to any number adds 1 mod 2 at precisely 
one place. Arithmetical, combinatorial, and 
parallel computational properties of the binary 
system are discussed and some properties of 
similar systems with higher radix briefly noted. 


I. Introduction 


At the 1976 International Conference on 
Parallel Processing the writer [1] developed a 
groupoid string formalism (based on earlier work 
on patterns [2]) and proved that groupoid string 
systems (a) had the computation universality of 
Turing machines, (b) corresponded in a simple 
detailed manner to one-dimensional cellula: auto- 
mata, and (c) had a natural parallel comput ition 
potentiality which led to developing bus auto- 
mata which actually realized it to a considerable 
degree. The success of the groupoid string 
approach, both in effectively speeding up Turing 
machine algorithms and in recognizing many kinds 
of formal languages, led to the hope that arith- 
metic computations might also be sped up if they 
could be properly formulated as groupoid string 
processes. 

The first problem is to find a number 
system, compatible with a groupoid string view- 
point, with some kind of parallel possibility 
inherent in its "local" structure. For example, 
addition might be done everywhere along a string, 
in parallel, without having to worry about carry 
chains. A positional notation seems mandatory 
to keep string length within reasonable bounds. 
We sought a binary system based on addition mod 
2 as the underlying groupoid. Having the one 
dimensional infinite shift register in mind as 
the "active tape'' suggested that integers corre- 
spond to finite strings over (0,1), with infinite 
strings of 0's understood as preceding and 
following the integer. The set of positive 
integers and zero, {I}, is then, in a regular 
notation in which the infinite O-strings are 
omitted, 


T=A+1+1(0+1)'1 (1.1) 


Here the null symbol corresponds to 0, 1 to l, 
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and the remaining positive integers correspond 
biuniquely to members of the infinite set of 
strings 1(0+1)*1 in a manner to be determined. 
Word length is required to increase monotonically 
with increase in size of the integer represented, 
with. doubling an integer increasing the length by 
1. It turns out that forming a daughter string, 
earlier used to generate patterns in the plane 
[2] where the groupoid operation is @ (addition 
over (0,1) mod 2), corresponds exactly to adding 
a left shifted copy of the original string to it- 
self without carrying. We take this daughter 
string as representing the integer which is twice 
the integer represented by the original string. 
We then obtain the powers of 2 as successive rows 
of Pascal's triangle taken mod 2. To avoid carry 
chains it is certainly necessary that addition of 
unity initiate no chain, The number must then 
change in precisely one digit, i.e., the number 
system is a grey code. 

These properties define the number system 
uniquely (up to direction of reading). However, 
it was so awkward to work with at first that it 
seemed doomed to remain a mere curiosity. For 
example, given an integer in this system, there 
was no easy way to find the next higher integer. 
But eventually properties of the system began to 
emerge suggesting that it could be useful in 
handling some combinatorial problems by parallel 
computation, even if arithmetic remained diffi- 
cult. It began to seem likely that a dimly per- 
ceived inversion of what was easy and what was 
hard between this number system and conventional 
ones had deep roots and could lead to interesting 
developments in computability theory generally 
and in parallel computation in particular. 

A particularly intriguing feature of this 
kind of number system is the beautiful way in 
which arithmetic properties of numbers are asso- 
ciated with symmetries of patterns generated by 
the daughter string process. A new kind of geo- 
metry of numbers, profoundly different from 
Minkowski's, which has algebraic and combinato- 
rial features and is also related to the archi- 
tecture of cellular and bus automata may emerge 
from this line of research. The possibility of 
arithmetizing the generation of geometrical 
pattern, and the emergence of patterns as pro- 
perties of classes of strings again suggest that 
this number system deserves prolonged study even 
if it should be utterly useless for numerical 
computations of conventional kinds. 

In the following sections we develop enough 
of the groupoid formalism to show how it connects 
both with algorithmic pattern generation and with 
parallel computation via cellular and bus auto- 
mata. The two naturally meet in Pascal's tri- 
angle, from which multiplication by 2 arose 
naturally. We then derive the number system for 


the integers and extend it to numbers of the 

form N2-" for all integer N and n. This gives 
binary approximations for all reals to any de- 
sired accuracy. These non-integers have a possi- 
bly null non-repeating part followed by an 
infinite periodic string, the number of digits in 
the period being a power of two. Some geometric 
and combinatorial aspects of the number system 
are presented, and it is shown that many pecu- 
liarities making the system awkward for conven- 
tional computations are helpful from the view- 
point of parallel operations on cellular and bus 
automata. After a discussion of ternary and 
higher bases, a tentative assessment of the 
significance of these number systems is made. 


II. Groupoid String Systems, Patterns, 


and Bus Automata 


A groupoid (G, 0) is a set G closed under a 
binary operation o, i.e., given arbitrary ele- 
ments a and b of G, their combination by means of 


that operation yields an element of G, say c. In 
symbols 

aob=c (2.1) 
or o: GxXxG+G (2.2) 


A finite groupoid is conveniently specified by 
its multiplication table. 

If the symbols representing groupoid ele- 
ments are taken as an alphabet and used to gene- 
rate strings by concatenation, we can form a 


daughter string veeed, dia ..++, from any 
string which can be called the parent string 
roeeP, Payy cece by 


d. = (2.3) 


a Py © Patt 
The strings can be finite, infinite to the left, 
right or both. We denote the set of all possible 
strings by GY, the finite strings by GC", the left 
and right semi-infinite strings by G” and G 
respectively, and the set of two-way infinite 
strings by G?. A groupoid string system is a 
subset of GY closed under the (unary) daughter 
operation (or relation). 
between strings is the transitive closure of the 
daughter relation, and its converse is the 
ancestor or ancestral relation. A string system 
is often conveniently represented as a set of 
directed trees. Nodes are strings and edges are 
drawn only from parent to daughter. Fig. 2.1 
illustrates the system of finite binary strings 
over {0,1} under addition modulo 2. Flow is to 
the root, and there are two distinct trees with 
roots 0 and l. 

A two-sided identity (unit, neutral element) 
and/or an annihilator (zero) can always be 
adjoined to any groupoid if not already present, 
and each is unique. Denoting an identity by e 


and an annihilator by z, they satisfy 
eoa=aoe=a (2.4) 
Zoa=aoz=zZ (2.5) 

and uniqueness follows from e o e' = e' = e and 


The descendant relation 
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{ ¥ + { 
1001 0110 1111 0000 
N 
101 000 
v { 
11 + 0 + 00 
4 | t 
010 111 
1100 0011 1010 0101 
4 4 t 4 
$ ¥ ¥ v 
0001 1110 1000 #0111 
00 hae 
y ¥ 
0.1 + 1 + 10 
4 n 
110 011 
1011 0100 A oo 
4 4 4 4 
@| 01 
0] 01 
1/10 


Fig. 2.1 Groupoid String System Over 
({0,1},®) Consisting of All 
Finite Strings, Where Arrows 
Point from Parent Strings to 
Daughter Strings 


zoz'=22 2', The string system of Fig. 2.1 

is embeddable in G*” over (z,0,1). by matching the 
strings in that system to the strings in G-” con- 
sisting of those strings preceded and followed by 
infinite "rays" of z's, and defining z as the 
daughter of any element of G. The daughter of a 
finite string with z at one end and e at the other 
has the same length as the parent. Such systems 
are ultimately periodic under iterated daughter 
transformations, as are infinite periodic strings, 
The latter give rise to an infinite set of "wall- 
paper" designs [2], the former to shift-register 
sequences. With infinite strings of e's preced- 
ing and following a word whose end letters are 
not e, each generation increments the length of 
the non-e positions by unity. Omitting the semi- 
infinite e-strings gives the generalization of 
Pascal's triangle for addition, starting from 1, 
to the general groupoid starting with an arbi- 


trary groupoid string. Figures 2,2 and 2,3 illus- 
trate the process for the cyclic groups of order 2 
and 3 respectively, starting from 1, giving 
Pascal's triangle modulo 2 and 3. 


1 
11 
101 
1111 
10001 
110011 
1010101 
11111111 
100000001 
1100000011 
10100000101 
111100001111 
1000100010001 
11001100110011 
101010101010101 
1111111111111111 
10000000000000001 
110000000000000011 
1010000000000000101 
11110000000000001111 
100010000000000010001 
1100110000000000110011 
10101010000000001010101 
111111110000000011111111 
1000000010000000100000001 
11000000110000001100000011 
101000001010000010100000101 
1111000011110000111100001111 
10001000100010001000100010001 
110011001100110011001100110011 
1010101010101010101010101010101 
11111111111111111111111111111111 
/ 100000000000000000000000000000001 


Fig. 2.2 Pascal's Triangle Mod (2) 


Computation of daughter strings is perform- 
able in parallel on a one-dimensional cellular 
automaton (CA), essentially a "shift-register 
accumulator" whose "logic'’ embodies groupoid 
multiplication; see Fig. 2.4 (applied to Equation 
(2.3)). For discussion of how the groupoid 
formalism covers the general CA, thus achieving 
computation universality via simulation of an 
arbitrary Turing machine, see Rothstein [1]. 
paper also describes the genesis and parallel 
speed-up aspect of bus automata (BA), giving 
further references to both the CA literature and 
the work of the writer and his former students, 
C.F.R. Weiman and J.M. Moshell. 

As noted, the CA can compute a complete 
daughter string, in parallel, in the time needed 
for one operation. To compute a parent string 
from a daughter string requires a BA; see 
Fig. 2.5. The BA is here one-dimensional, i.e., 
it utilizes a linear array of similar finite 
state automata Ag» Ay A,» Kee Al» came ne 
states of individual automata, s 


That 


0° Sy: - Si> are 
labeled by the groupoid elements, which, in this 
case, are also the input and output alphabets. 


Ordinarily the state, input, and output alphabets 
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1 
11 
121 
1001 
11011 
121121 
1002001 
11022011 
121212121 
1000000001 
11000000011 
121000000121 
1001000001001 
11011000011011 
121121000121121 
1002001001002001 
11022011011022011 
121212121121212121 
1000000002000000001 
11000000022000000011 
121000000212000000121 
1001000002002000001001 
11011000022022000011011 
121121000212212000121121 
1002001002001002001002001 
11022011022011022011022011 
121212121212121212121212121 
1000000000000000000000000001 
11000000000000000000000000011 
121000000000000000000000000121 
1001000000000000000000000001001 
11011000000000000000000000011011 
121121000000000000000000000121121 


Fig. 2.3 Pascal's Triangle Mod (3) 


+f 
Logic J 


Fig. 2.4 Shift Register Computation 
of Daughter String 


Fig. 2.5 Bus Automaton Explained by 
Means of a Connected Linear 
Array of Automata {a,} 


are designated by distinct sets of symbols, e.g. 


respectively as 


K = {sy, Syste Sd (2.6) 
> = {59> Ops see o} (2.7) 
@ = {855 61> : 6S (2.8) 


The effect of input oe on state S. is to induce a 


transition to state s,, and produce output OL 


which can be written 


) (2.9) 


we > (Sys Or. 


In the present case we can both take 


(2.10) 


side of (2,9), 
We then have 


and replace the resulting right 
namely (ds ds simply by d.- 


(2.11) 


k of G replacing So 


oa and (Sy 0) respectively. 


where Pi> Ps? d,| are elements 
We use the symbol 


(2.11) instead of 
(2.9) to stress that 
to embody the group- 


for the groupoid operation in 
the concatenative notation of 
each automaton A, is designed 


oid multiplication table in its state transition 
function. The arrows drawn in Fig. 2.5 signify 

that (2.9) applies in the sense that the tail of 
an arrow at S5 of A, is drawn with its head at ae 
of Away if p, is the input of A> d. the output 


(a Moore machine is here chosen for definiteness, 
a similar discussion applies to Mealy machines). 
The BA concept combines the automata A, with 


communication busses controlled locally by them 
for calculation of a parent string in terms of a 
daughter string. The arrows become bus sections. 
They are hooked up as part of the setting-up con- 
dition in which the d. string is used as input. 


The p, string is now the output activated by a 


continuous path (through arrows) between end 


markers or the like bounding the d. String. As 


semigroups are associative groupoids, this leads 
to immediate recognition of regular languages and 
thus to Turing machine speed-up in a number of 
steps one more than the number of tape turn- 
arounds [1]. 


III. The Binary Parallel Number System 

We now derive the number system briefly des- 
cribed in I, beginning with a statement of the 
fundamental theorem. 

Theorem 3.1 There exists a binary number 
system unique up to reversal of reading direction, 
satisfying (i) 1 designates the integer one; 

(ii) the double of any integer is obtained by add- 
ing a copy of that integer, shifted one place to 
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the left, to the original integer, where addition 
in each place is modulo two; (iii) the number of 
digits representing any integer can not exceed 
the number of digits representing a larger 
integer; and (iv) addition of one to any integer 
changes one and only one digit. 

The proof depends on a string of subsidiary 
results which will be stated and proved as theo- 
rems because of their interest both for the num- 
ber system and for string patterns. In the sequel 
integers are understood to be in this system. 

Theorem 3.2 The powers of two are given by 
successive rows of Pascal's triangle mod 2 and are 
palindromes, 

Proof: The rule for constructing Pascal's 
triangle is clearly expressible as the daughter 
string algorithm for addition; for mod 2 addition 
it is the doubling rule (ii) of the hypothesis of 
Theorem 3.1. The symmetry of Pascal's triangle 
makes the strings read the same backwards and 
forwards, i.e., they are palindromes. Fig. 2.2 
gives the powers of 2 from zero to 32, 

Theorem 3.3 For any integer N, the integers 
2" N, n=0, 1, 2, ..-, are given by the success- 
ive rows of the trimmed Pascal triangle whose 
first row is N: equivalently the product of N and 
2" is calculated as in conventional binary except 
that the final additions are performed mod 2. 

Proof: The first half of the theorem is 
established as in the proof of Theorem 3.2; it is 
simply the doubling rule iterated. Before estab- 
lishing the second half, we refer to Fig. 3.1, 


3 111 1 a” 
6 1001 11 2 
12 11011 101 4 
24 101101 bids 8 
48 1110111 10001 16 
96 10011001 110011 32 
192 110101011 1010101 64 
11011 = 12 
xX1111 = 8 
11011 
11011 
11011 
11011 


10011001 = 96 


Fig. 3.1 Three Times a Power of Two 


illustrating the case N = 3 (proof that 3 is 
uniquely 111: (a) it can have no more than three 
digits because 4 is 101; (b) it must have more 
than two digits because 2 = 11 is the only permis-— 
sible two digit number as 01, 10, and 00 would be 
partially or totally ‘absorbed" in the semi- 
infinite O-strings; (c) the only possibility left 
is 111). To see that the method of the illustra- 
tive multiplication example is again but another 
form of the doubling rule, compare doubling and 
quadrupling as illustrated by Fig. 3.2. The 
number XX Xs is shown with a copy left 


shifted and added to the original string to 
double it, and this result is again duplicated as 
shown to quadruple the number. The unshifted 


1 x x -1 eglg EE ) 
1 . (11) | 
1 “a “n-1 
” > (101) 
ag x <x 
ToL n n-l 


Fig. 3.2 Multiplication by a Power of Two. 


member of the shifted double clearly lines up with 
the shifted member of the unshifted double. They 
"cancel" in mod 2 addition, leaving only the 
result of adding a copy shifted two places left to 
the original string as the quadruple of the orig- 
inal number. The asserted result now follows by 
induction. 

Theorem 3.4 The doubles of two numbers which 
differ in only one place differ in two adjacent 


places. 
Proof: Let the numbers be a Se 
a and a a morte Wea 3 = 1 and 
l= 0, a call them N and N respectively. We 
have 
e Spec Xe ey 
x “Xa vee KK, 
2N = Yor Yn Ae See 
*a es Ae *o 
x ve) XX ree KK 
NE et a. Oe ea I 


where we have used the easily proved theorem for 
addition mod 2 that 


x ® y = z implies x ® y = z (3.1) 

This theorem clearly generalizes by induction 
to multiplication by 2" as stated in the next 
theorem. 

Theorem 3.5 The multiples by 2” of two num- 
bers which differ in one place differ in at most 
(nt+1) contiguous places, namely those correspond- 
ing to 1's in the number 2", where the rightmost 1 


of 2" lines up with the x, in which the two 


original numbers differ. 
Proof: Form the trimmed Pascal's triangle 
starting from N = Koes Ks and put a bar over 


x, as in the proof of Theorem 3.4. Then the bars 


propagate just as the 1's do in Pascal's tri- 
angle, for (3.1), via commutativity of ®, shows 
that 

(3.2) 


x@y=2z implies x @ y = Zs 


i.e., two bars (1's) combined give none (0). But 
this is just what the theorem states. 
A similar proof, which we omit, proves the 
further generalization stated in the next theorem. 
Theorem 3.6 If two integers are right justi- 
fied and comparisons of digits in the various 
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positions are encoded as 1 when they differ and 0 
when they are the same, then the differences in 
their multiples by 2” are encoded as the multiple 
by 2” of the first encoding. 

Theorem 3.4 is needed to construct the num- 
ber system, while 3.5 and 3.6 are of interest in 
bus automata and groupoid pattern investigations, 

We now prove Theorem 3.1. We already know 
that 1, 2, 3, 4 are respectively 1, 11, 111, 101, 
and that given the doubling rule, if we know the 
integers 1 through N, the even numbers from N to 
2N are determined, leaving only the odd numbers 
between N and 2N to be found. By Theorem 3.4 two 
successive evens differ in two adjacent places, so 
by (iv) of the hypothesis of Theorem 3.1 the odd 
number between them differs in precisely one of 
those two places from each of the two evens. We 
need only show that all the numbers from (N+1) to 
2N, for N a power of 2, have one more digit than 
N, and give the string for one odd integer, e.g, 
N+1, to conclude that all the representations of 
the integers are uniquely determined to 2N. 
Uniqueness would then follow, by induction on n, 
N=2", for all the integers; the base of the 
induction having already been amply provided by 
the truth of the theorem for n=0,1,2. Note that 
5, 6, 7, 8 must all have four digits and begin and 
end in 1, just as in the argument leading to 111 
for 3. End O's are excluded; no three digit 
strings are available for 5; 6 and 8 have four 
digits by the doubling rule, so 7, which can not 
have fewer digits than 6 nor more than 8 also has 
four; 5 can have no more than four and must have 
more than three, and so exactly four. As 6 and 8 
are 1001 and 1111 respectively, 5 and 7 are 1101 
and 1011, not necessarily respectively. Note that 
5 has only the two possibilities of prefixing or 
suffixing a 1 to 4 = 101, that 3 can be regarded 
as either prefixing or suffixing alto 2 = ll, 
and that the two cases are mirror images. In 
accordance with the custom of forming larger num- 
bers by adjoining digits on the left we take 1101 
for 5, thus having only the choice 1011 for 7. 
The four four-digit numbers exhaust all possibili- 
ties for filling the two center places. The same 
situation recurs in the next "octave", 9 through 
16. The doubling process always converts strings 
beginning and ending in 1 into strings beginning 
and ending in 1, the interpolated odds must do 
likewise, and by Theorem 3.4 adoption of the pre- 
fix convention for 5 forces it for 9, and by 
induction for 21, n > 3. The 3 interior places 
in the five-digit numbers use all possible 
"fillings" for the 8 integers they represent. 

The next octave of integers thus must again all 
have one more place, doubling the numbers of 
"fillings" which again exactly accommodate the 
new integers. By induction this is always the 
case, whence the representations of all the 
integers are unique up to a mirror reflection of 
the entire system. The choice is a reading direc- 
tion convention common to all number systems, 

This completes the proof of Theorem 3,1, but 
armed only with that result one might despair of 
ever being able to count in this number system, 
The only simply specified odd numbers are 
{2 +1 |n=0,1,2 ...}. One would then have to 
find consecutive evens in each of their octaves 


between which each one could fit, the other of 
the two possibilities then being the right one 
for that slot. Then one would have to find 
another place where this last number might have 
fitted, plug the other possibility in that place 
and so on, until all slots are filled. However, 
observe that the arguments given to build up the 
integers from 1 to 2", with the list, in order 
and right justified, can be applied to the suc- 
cession of digit changes needed to count from 


2741 to agli The only novelty is that now the 


left justified reversed list is added mod 2, 
digit by digit to 2", where the list lines up its 
left-most digits one place to the left of the 
left-most digit of 2". Fig. 3.3 illustrates the 
process applied to 1 through 4 to obtain 5 


1 1 1101 5 1011 11001 13 
2. I > 1001 6 1001 > 11101 14 
3. 111 1011 7 1101 10101 15 
4 101 1111 8 1111 10001 16 
101 1111 
Fig. 3.3 Counting in Octaves 


through 8, and the latter reversed to obtain 13 
through 16 via addition of 8. The process is 
amenable to parallel operation, e.g., in a planar 
BA, the 1's in 2" complementing the digits of the 
corresponding columns of the list, 0's leaving 
them unchanged (blanks are interpreted as 0's). 
The above can be viewed as an extension of the 
doubling rule, applied to a power of 2, to addi- 
tion of a power of two to any number not exceed- 
ing it. We state it as Theorem 3.7. : 

Theorem 3./ To find the sum of any integer 
and a power of two not less than that integer, 
shift the reversed integer one place left from 
where it would be if left justified with the 
power of two and sum digit by digit mod 2. 

Applied interatively this immediately proves 
the correctness of the following algorithm to 
convert conventional binary to parallel binary. 

Theorem 3.8 To find the representation of 
an integer given in conventional binary (a) left- 
shift the smallest power of 2 indicated by it 
with respect to the left justified list of the 
remaining powers of 2; (b) add the first two 
numbers; (c) reverse the result and add to the 
next; (d) repeat a, b, c, alternating reversals 
and additions until one number remains. 

This algorithm can be inverted to convert a 
given number to conventional binary. 

Theorem 3.9 To convert a parallel binary 


number with n digits to conventional binary, first 


add ae digit by digit, mod 2. 
a string of n O's the number is ats 


Pat : di 
at: 2 again, restoring the original number, and 


repeat the process with gone right justified, 


writing 1 in the ge position of the sought con- 
ventional binary number. The result of the addi- 
tion will have a string of k O's to the right, 

k => 1. Cross them off and write (k-1) O's in the 
next (k-1) positions of the sought conventional 


If the result is 
If not, add 
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binary number, The procedure is then repeated 
with the reversed shortened string of (n-k) 


digits (add ee, right justified, mod 2 in 
each place, etc,.), giving further digits of the 
binary number sought, and so on until the string 
has been reduced to a power of 2, 

Proof: The procedure reverses the steps of 
the previous theorem, for what was called addi- 
tion is also subtraction mod 2. It alternately 
tests whether the given or modified string is a 
power of two and subtracts an appropriate power 
of two. 

A procedure similar to that of Theorem 3.7 


works to add hi to all numbers from a4 to gntl 


Theorem 3.10 To add 2" to any number from 


2"+1 to pitt shift the integer left one place 

from where it would be if right justified with 

the power of 2, and sum digit by digit mod 2. 
Proof: The sums sought are the numbers from 


ott to pote. From the way they would be con- 


structed by the octave counting method it is 
clear that the procedure given is equivalent to 
doubling the 2" part first and then adding from 
16.2, 

We list a few useful miscellaneous results 
in the next theorem, most proved already. 

Theorem 3.11 All numbers begin with 1 and 
end with 1, and have an even number of 1's if 
they are even and an odd number of 1's if they 


are odd. The largest number with (n¢l) digits 
is rage the smallest is oe. 
Proof: The assertions needing proof are 


those on the number of 1's. Let N have k l's. 
Then 2N will have (2k-2k') 1's, where k' is the 
number of places in which 1's of the shifted copy 
are above 1's of the original copy. Then 2N, for 
arbitrary N, has 2(k-k') 1's, i.e., all even num- 
bers have an even number of 1's. Odd numbers 
differ by one, in their number of 1's, from the 
evens, by (iv) of Theorem 3.1. 

The next theorem is of particular interest 
as it shows that mere counting by octaves to Ja 
also computes the set of binomial coefficients 
\ 
fi, inaugurating the 
combinatorics". 

Theorem 3.12 If the integers from 1 to ys 
are tabulated, either left or right justified, 
then the number of digit changes (alternations) 
in the rth column from either end (blanks are 


study of "parallel 


counted as O's) is precisely Bh, 

Proof: The theorem is true by inspection of» 
Fig. 3.3 for n= 2. Assume it is true for n=k 
and consider the right justified list of integers 


from 1 to re By the discussion in connection 
with Fig. 3.3 which led to Theorem 3.7, the num- 
ber of digit changes in the first k columns of 
the second half of the list is the same, column 
by column, as those of the first half. We thus 
form the total number of changes for case ktl by 
adding the changes for two contributions, offset 
by one, of case k. But this is identical with 
the rule by which Pascal's triangle is con- 
structed, whence the numbers of changes are given 


by (1) and the theorem is established. 


Fig. 3,4 gives a Markov algorithm for 
doubling a number and illustrates its use, As 
the markers move in one direction their action 
can be simulated by a finite state machine, show-— 
ing once again that doubling is "immediate" on a 


BA. As only one marker at most is ever present 

a0 > 0a 1101 =°5 

al + 18 o1101 

BO > la 186101 

B1 > OB. 10801 

By aot 10101 

a > .A 10118 

A 7a 10111 = 10 

Fig. 3.4 Markov Algorithm for Doubling 


the only critical features of the priority order- 
ing of productions are that marker introduction 
has lowest priority, and that all conclusive pro- 
ductions, as a group, have next lowest priority. 
Markov algorithms for daughter strings over gen- 
eral groupoids are readily transcribable from 
their multiplication tables. For example, in G”, 
where daughters are one letter shorter than their 
parents, with G = {g, > Bos tee > g} and 


6; ° &. 


= we can write 
5 81? 


“08s > B4%4 


(3.3) 


for the algorithm which computes the daughter for 
all finite strings of lengths greater than one, 
and leaves strings of length one unchanged. 

There are (ntl) markers, but if the groupoid has 


an identity e, n markers suffice as Oy can be re- 


placed by O: The productions above are multiple, 


except for the last, the first, second and third 
standing for n, n2, and n different ones, the 
second representing the multiplication table, 
essentially. 

The algorithm of Fig. 3.4 is easily modified 
to multiply by 2": replace the last rule by 


A > OL O10. « «Os keeping the other rules unchanged 


and also applicable to a, and Be except that the 


f 
conclusive productions for a and 8 cease to be so 
and become conclusive only for Og and Bee Simi- 


larly we can easily write a groupoid nth genera- 
tion descendant Markov algorithm. The markers 
still move in one direction, so the computation 
is still immediate on a suitable BA. This can be 
generalized to the equivalent of a Turing machine 
speed-up theorem distinct from the one cited 
earlier. 

We close this chapter with some observations 
on addition and multiplication in parallel bi- 
nary. First the simple rules for adding suffi- 
ciently high powers of 2 to a number or to multi- 
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ply by a power of 2 fail in general. For 
example, if one computes 3 xX 3 by the rule of 
Theorem 3.3, namely as in conventional binary 
except that the final additions are mod 2, the 
answer is 15 while 3 X 5 gives 27. Addition is 
particularly frustrating, for given some big 
number, there still seems to be no simple way 
even to tell what digit to change in order to add 
1 to it (short of finding where it occurs in the 
list of integers and going to its successor). 
The difficulty stems from the way powers of 2 are 
folded into an integer; adding two numbers con- 
taining a common power of two still seems to 
require replacing the two contributions of that 
power by the next higher power. Carrying has 
thus not been avoided, in essence, It may well 
be that Theorems 3.8 and 3.9 will have to be 
utilized in some form to do many ordinary arith- 
metical tasks, even on a BA, but conversion is 
rapid both ways and may often be necessary only 
in part. Hopefully results like Theorem 3.12 
will ultimately abound and have important 
practical applications. 
IV. Extension to Fractions 2 "N 

It is known that the direct product of any 
number of cyclic groups of order 2 is a group 
(hence a groupoid) satisfying the "self-solving" 
conditions 

aob =coa (4.1) 

as well as being commutative and unipotent. This 
means that in Pascal's triangle mod 2 not only 
are successive rows daughters of their immediate 
predecessors (remember that the rest of the half- 
plane is covered by O-strings) but strings 
parallel to the sides, terminating on the sides 
of the original triangle are parents of the 
parallel strings immediately above them. As the 
original sides, lll... , give 1 when doubled, we 
immediately deduce that the successive semi- 
infinite strings parallel to one side, starting 
with that side represent the successively higher 
negative powers of 2. Using (4.1), rewritten in 
the notation of (2.3), permits us to write 


(4.2) 


and to interpret the resulting recursive algo- 
rithm for finding the parent string as division 
by 2. 

Fig. 4.1 shows how both of the foregoing 
developments can be combined in an extended 
Pascal half-plane pattern. The horizontal arrow 
points to the vertex of the original Pascal tri- 
angle, the vertical arrow to that of the Pascal 
triangle made up of inverse powers of 2 calcula- 
ted by (4.2). The infinite half-plane to the 
left of the common side of the two triangles is 
covered with O's, and an infinite triangle of 0's 
fills the rest of the half-plane containing the 
two triangles. 

For N = 282 3%3 5"%s .,. p'P the halving pro- 
cess will not give infinite strings (as usual we 
suppress infinite O-strings) until the n, powers 


the negative powers of a binomial is old. Just 
as its rows give the coefficients of (1+x)", so 
Pascal to Infinity do the infinite sequences ("Sides") parallel to 
one side give the coefficients occurring in the 
series expansions of (1-x)". Proofs of many 
results of this chapter can be devised by using 
Ld LE O00 0 1 1 x.. « : ; 
Half-Plane 1000100010 such standard theorems with the binomial 
of Zeros - +s coefficients taken mod 2. 
LO 0.1.1.0 :-0O-2 Fs es : ; 
1010101010... We now designate a power of 2 by T, with 
ed ee a a De Pe ee superscript (positive or negative a RECECT of 
zero) when needed, whether the string be finite 
——j loa0ao0od0d00000000. : n Pyle Satine ; 
(integer, 2°), or infinite to the right 
1100000000. : -~n 0 ; ae 
1010000000. (fraction, 2 ). We have TY = 1, T* = 11, 
T? = (n+1) row of Pascal's triangle, 


/ The connection between Pascal's triangle and 


iE 
1 
1O1l1lO0000010. 


7: e e oe 
Bee see fot ee 1117 oye ; 2 = 10101... ,» and so on, 
1o0oo0o01i 00000... Zeros to ; ay oe eas : : 
Bite with T7™™ = (n+l) side" of Pascal's triangle. 

1100110000... Infinity : n nN, 

1010101000 The difference between 2” and T’ is that the 
= latter admits an interpretation as an operator on 
ed ee Ee Oe ess . ; , ; 
all binary strings, while the former is a number 


1000000010 which is T"(1). The set of operators 


1 Pascal to ae CE ee Dy Sle Oy. 2y nae} SS Clearly 
Infinity 10 the free commutative group on one generator iso- 
morphic to integer addition, and T"(1) = 2". 
Explicitly, 
Fig. 4.1 Positive and Negative Powers of Two tT” = ‘i (4.3) 
tm pm. 7? (4.4) 
of 2 are eliminated, i.e., the new N is odd. T™ (1) = 2 (4.5) 
Eventually the process produces a trimmed tri- 
angle from which the periodic infinite part of The operation T is essentially the daughter 
2-"N is read off parallel to its sides. See algorithm for 2-way infinite strings, with finite 
Fig. 4.2, which shows the process for 9 = 11111. and semi-infinite strings understood as having 
The vertical arrow indicates the left end digit two or one semi-infinite string(s) of identities 
in the remaining places. It thus has meaning for 
1 an arbitrary groupoid. The interpretation of T™ 
Ey Ole. nee tae S56 68t SSB cue ey as an operator on strings over an arbitrary 
Dy eb GOD es he oe, ie ae NS ae groupoid is not generally the corresponding des- 
10101111010100 cendant or ancestor function, encoded as a binary 
11111000111d31d100 word. The operator T2, for example, encoded as 
100001001000010 101, can be interpreted as element by element 
1100011012100011 groupoid multiplication of the operand string 
— 3 101001011010010 with a copy of itself shifted two places left. 
1111011120111011 For the daughter function (2.3) it is natural to 
00 OL. 20 0 2b 046 A. 10: ay % take the first "factor" from the shifted string. 
2100-0 1 0-20-10 1 OT OL s. &- x For the general groupoid string ...a, a, ee 
I, 60% Lat ea o anne eee 


111110000000000. the "grand-daughter" element is 


1000010...... (a, re) asay O (A544 O Asa)» not (a, re) 4442) as 
in {(0, 1), ®} and as called for by T?, 

One might question the need for introducing 
T-notation; except in the case at hand, where it 
appears to be only multiplication by a power of 
2, it has limited applicability, and so little 
interest. However, the notation makes it clear 
that substituting T™ for the 1's of T™ for all 
n, i.e., writing T™ for the 1's of Pascal's tri- 
angle generalized to the half-plane transforms 
that half-plane into itself by a "translation" of 
m along the "row coordinate". The interpretation 
of the (T™, 0)-strings as appropriately shifted 
@®-addition of as many copies as there are occur- 
rences of T™ in the string verifies their equiva- 
lence to the 2"'™ rows of the Pascal half-plane. 
But now, for any interpretation of the 1's of 
Pascal's triangle as T™'s, for fixed positive or 
negative m, we can carry through the entire chain 


Fig. 4.2 2 "N for N= 9 


of 9. The horizontal arrow indicates the 1, 
which together with the 1 beginning the infinite 
string of 1's of 9/2, bounds the "top", 11001, 

of a trimmed Pascal triangle whose rows are 2"13. 
The lines parallel to its bottom side give the 
periodic parts of 2-"9. The 9 is also the top of 
a trimmed triangle (whose rows are 29). This 
relation between 9 and 13 is reciprocal for 
2-"13 has its periodic part given by the side 
strings of 2%9, Every odd number is clearly 
paired in this way with an odd number with the 
same number of digits. Some, like 1, 3, 11 and 
15 (respectively 1, 111, 10011 and 10101), pair 
with themselves. 
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of reasoning by which the number system was con- 
structed. We can therefore take the entire list 


of positive integers, replace the 1's throughout 
by any one T™, perform the indicated shifted ®- 
additions, and come out with the list of integer 
multiples of T™(1). In short, we obtain 2™N 
whether m be negative or positive. . 

We sum up the main results obtained in this 
chapter by stating them as theorems. 

Theorem 4.1 The negative powers of 2 are 
right semi-infinite binary strings lying parallel 
to one side of Pascal's triangle and beginning 
with an element of the other side. The side it- 
self represents 2”™”~ and successive contiguous 
parallel erhtues represent successively higher 
powers, the nt being 27-?. When Pascal's tri- 
angle is augmented with the strings produced 
successively by the parent string algorithm, with 
the initial 1's of the set of ancestor strings 
continuing the alignment of the side of the ori- 
ginal triangle containing the initial 1's of the 
positive powers of 2, then a half-plane is 
covered by three regions (a) the original Pascal 
triangle, (b) a triangle of 0's, (c) a Pascal 
triangle whose left side is parallel to the rows 
of the original triangle, and whose right side is 
a continuaticn of the left side of the original 
triangle. 

Theorem 4.2 The string representing 2” 'N 
for any positive integer N, and for m any posi- 
tive or negative integer or zero, is obtained by 
multiplying the strings representing 2™" and N as 
in ordinary binary except that the final addi- 
tions are mod 2. 


V. Related Number Systems 

In this chapter we summarize our initial 
results on two kinds of number system, or better, 
two compatible methods for constructing number 
systems related to the parallel binary system 
discussed heretofore, and make some observations 
about possibilities for more general systems. 


The first is the natural generalization from 


base 2 to any base. The second, based on the 
observation that one penalty for requiring both 
monotonic increase in number of digits with 
increasing integer size and the grey code pro- 
perty was loss of the neat multiplication algo- 
rithm of powers of 2, requires instead that 
multiplication of any two integers be done the 
same way as-it is done for the base. There may 
be many other properties, the presence of which 
would partially or completely define a complete 
unambiguous representation of the number system. 
One must procede very cautiously, however, for it 
is easy to demand properties which are, in fact, 
incompatible. We close with some remarks on 
generalization of the underlying groupoid from 
cyclic groups to general groups, semigroups, 
quasigroups, and general groupoids. 

From Fig. 3.4 and the Markov algorithm (3,3) 
it is easy to see that the daughter string algo- 
rithm is easily adapted to the design of a base 
k number system. It multiplies a string by k. 
The rows of Pascal's triangle mod k represent the 
powers of k, and inverse powers, together with 
the positive powers, fill out a half-plane as 
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1 
i ee ae 
10010000020 
11011000022 
12112100021 
10020010020. 
Ty LO: 2 2 Ode e@ <2? 22 
Dee te 2! de 2 2 ad 2 
100000000000 
1100 — 
i’ oes ae ere ee 


Fig. 5.1 3°" in Base 3 


before. Fig. 2.3 gives that part of Pascal's 
triangle representing 3" for 0 < n < 33 and 

Fig. 5.1 gives a portion of the half-plane repre- 
senting cia On. Note that the triangle above 
the triangle of 0's has 1212 for one of its 
sides; only in base 2 does it coincide with the 
original Pascal triangle. In base k that side 
becomes 1(k-1) 1(k-1) We have proved uni- 
queness of the system of integers in base 3, 
assuming monotonicity and the grey code property 
as before. The argument is much more laborious 
than for binary (omitted for lack of space) 
because the rule for multiplication by 3 leaves 
two empty slots between the triples of consecu- 
tive integers. Uniqueness was proved by showing 
that only one way of building the system "from 
the ground up" avoided inconsistencies later on. 
In the resulting system negative numbers "mirror" 
the positives in a pleasing fashion. Using the 
build-up process "backwards" through 0 gives, for 
the negative of abc... the stringabec..., 


where barred elements are inverses of unbarred 


ones, 0=0, 1=2, 2= 1. Fig. 5.2 gives the 
Markov process for multiplication by the base k 
and a part of the number system in base 3. While 
there is little doubt that parallel systems in 
any base can be constructed, our experience in 
proving uniqueness for base 3 and the large num- 
ber of possibilities for base 4 led to low prior- 
ity for attempting to prove uniqueness for all 
bases, particularly in view of the possibility of 
non-uniqueness suggested by the second class of 
number systems, to which we now turn. 

As the multiplicative properties of the 
powers of 2 are simple (as in standard binary, 
except that the final sums are performed mod 2), 
but complicated if neither factor in a product of 
two factors is a power of 2, we examined the con- 
sequences of requiring that multiplication always 
be performable in that fashion. It turns out that 
"saps" and inversions appear (inversion here means 
having the smaller of two integers with a larger 
number of digits). The gaps, naturally, are 
filled in some fashion by assigning primes to them 
in some fashion, and there seems to be nothing yet 
visible to restrict how one should assign strings 
to primes ("in order" no doubt, but what is that 
order). Unique factorization exists as does a 
simple algorithm to test divisibility, and all of 
the foregoing apparently applies to any base. 

To take a specific case, take 1 for 1, and ll 
for 2 as before, getting the same strings as 


(a) tI a 0 > 0a ete a0 > id 
0 i 


0 0 eae 
yt > la, cae 
a, j +> IOs; ee 
a, (k-1) > (k-1)a, 
2s. Os. Oak 3. Ara 
i 0 
(b) v= {0,1,2} 1 I aS 2101 
{N} = Atv+(14+2)V 1 2 21 16 2111 
ad > 0a 3 11 17 2011 
al > 18 4 211 18 2211 
2 > 2Y 5 221 19 1211 
80 > la 6 201 20 1011 
B1l > 2B 7 101 21 1111 
B2 + OY 8 111 22 1121: 
yO > 2a 9 121 23 1021 
yl > OB 10 2121 24 1221 
¥2 > LY 11 2221 25 1201 
a > .A 12 2011 26 1101 
Boo* 1 13 2001 27 1001 
yo eZ 14 2201 
A 7a 


Fig. 5.2 (a) Markov process: multiplication 
by k in base k 
(b) Base 3 


before for 2". We can take 111 for 3 and thus 
get the same expressions for 2"3 as before. But 
now 9 must be 10101; compare it with 8, which is 
still 1111, and we see the grey code property is 
gone. If we choose the previous number for 5, 
1101, then 15 is now 100011, whereas 16 is 10001, 
and the monotonicity property is also gone. Our 
investigation to date suggests that these infeli- 
cities are invariable companions of the simpli- 
fied multiplication rule. There is consolation 
in that division is easy: the recursive rela- 
tions involved in computing reciprocals are 
easily solved, and dividing by n is the same as 
multiplying by 1/n. To illustrate we compute 
1/3, 3 = 111. Let the string sought be 


aay, ~e-- Then we must have 
a,45433, P Asay 
a,4,, r it] 

aa, ; a. 


from which we easily find that 1/3 is 110110110.... 
This string is periodic; the 2-way infinite 
string is one of many studied earlier in connec- 
tion with patterns [2]. Indeed, if a daughter 
element be written below and between the two 
parent elements of which it is the product so 
that all three are at the vertices of an equi- 
lateral triangle, then in the periodically 
covered part of the plane the 1's are on the 
vertices of a network of hexagons with 0's at 
their centers (Fig. 5.3). Our research has thus 
come full circle: study of groupoid string pat- 
terms [2] has led to "number systems" which may 
well become powerful tools for investigating 


1101101101 
1011011011 

eee 1101101101 “ars 
011011011 
1101101101 


Fig. 5.3 Periodic Covering of the 
Plane by 110110... 


aspects of pattern, symmetry, and geometric 
configuration. 

The last possibility for generalization con- 
sidered here is that of admitting more general 
groupoids than cyclic groups. If associativity, 
unique solvability, or commutativity are sacri- 
ficed one might think that all chances of build- 
ing a number system have been lost. Many might 
think that even abandoning cyclic groups (single 
generating element) for more general groups would 
have similar effects. However, we have shown that 
many different quasigroups (groupoids) with unique 
two-sided solvability) can generate precisely the 
same totality of patterns in the plane. A single 
pattern-equivalent class can contain associative, 
non-associative, commutative, and non-commutative 
quasigroups, and direct products of as many dif- 
ferent ones as desired. Considered as algebraic 
systems, they are often strongly non-isomorphic. 
It would be rash, therefore, to deny the possibi- 
lity that such systems might be useful for paral- 
lel computation. Semigroups, too, in spite of the 
fact that inverses of elements usually can not be 
defined, should not be ruled out either, both be- 
cause of the intimate connection between them and 
finite state machines and because at least one 
cyclic group is associated with every idempotent, 
be it an identity or an annihilator. 


VI, Concluding Remarks 


Although a truly parallel arithmetic has not 
yet been constructed, those found may have pattern 
and combinatorics so firmly woven into their very 
structure that they may permit new attacks to be 
mounted on some very difficult problems. They 
have intriguing “nesting” possibilities, like the 
substitution of T™ for the 1's of Pascal's tri- 
angle, and a totally new mix of "local" and 
"global". They approach additive and multiplica- 
tive properties of numbers more from the multipli- 
cative side, in contrast to traditional mathema- 
tics (e.g., Peano's axioms). We therefore feel 
they have great interest and much promise, despite 
the profound problems they present. 
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RESPONSE TIME OF PARALLEL PROGRAMS 
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Abstract -- The response time of a parallel 
program is defined to be the maximum delay between 
successive activities of an event. Response times 
are dependent on two factors: the parallel pro- 
gram's structure and the program's scheduler pol- 
icies. It is shown that under weak assumptions 
about the scheduler policy, the imposition of an 
_N-fair policy in which each event gets a chance 
to execute at least every N scheduler steps, the 
response time becomes dependent only on program 
structure: either the response time is infinite 
or it is linear in N (i.e., <cN for some c>Q). 
Also presented are decision procedures for deter- 
mining whether or not the response time is in- 
finite and for determining the exact linear rela- 
tionship in N (i.e., the minimum c). 


1.0 Introductton 


The response time of a parallel program is 
the maximum time that an event in the program may 
ever wait for a chance to execute. Response time 
is clearly important in realtime programs: large 
or unbounded response time may cause the program 
to fail. Even nonrealtime programs may be ser- 
iously degraded if the response time is too large. 

Response time of a parallel program is not 
easily computed. Often it is only determined by 
empirical observation. The fundamental question 
addressed in this paper is: 


How can one compute the response time 
of a parallel program? 


Previous studies of this question [1,4,8] have 
shown that, under certain assumptions about how 
programs are scheduled, one can show that parti- 
cular events execute infinitely often. While 
this type of information is useful, there are sit- 
uations where it is inadequate: e.g., in a real- 
time program for data acquisition, knowing that 
an event will eventually execute does not guaran- 
tee that data (for example) will not be lost. 

In order to get a more useful analysis of 
the response time of a parallel program we will 
make stronger assumptions about how our parallel 
programs get scheduled. In all of the previous 
work very weak scheduling assumptions have been 
made. Here we will assume instead that we have 
scheduling where each event of the parallel pro- 
gram gets a chance to try to execute at least 
every N scheduling steps (N>0), in the worst 
case. To avoid scheduling anomalies, it is nec- 
essary that N be at least as large as the number 
of events in the program. Even in this case, of 
course, some events may get their chances faster 
than others; however, no event ever waits longer 
than N steps to be "looked" at by the scheduler. 

Clearly, an event may mot be able to execute 
every N steps; it may have to wait for some other 


event to occur. In particular, if r is the res- 
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ponse time of some event, then r > N is possible. 
A basic question is then: 


As a function of N, what values can r take? 


For example, can r be N square, i.e., can r grow 
non-linearly in N? The answers to these questions 
are contained in the response time theorem: as N 
grows, either 


(1) the response time of an event becomes 
infinite (i.e., in the worst case it 
can wait forever), or 

(2) the response time is linear in N (i.e., 
it is bounded by cN, for some constant 


c). 


To fully describe the response time behavior 
of a parallel program we must consider the ques- 
tion of how one can compute the smallest such c 
(our main theorem gives an upper bound) after 
having determined that the response time is fi- 
nite? The answers to these questions are given 
by providing decision procedures for the follow- 
ing questions: 


(1) given an event e of a parallel program, 
can e ever have infinite response time? 

(2) given a constant c>0, is the response 
time of e<cN for all N? 


These questions are reduced to questions about 
suitably encoded vector addition systems [10]. 

The remainder of this paper has the following 
organization. In section 2 we give a formal model 
of parallel programs and their computations and 
show how this model relates to the parallel pro- 
gramming notations found in the literature. The 
scheduler of a parallel program is presented in 
section 3. A scheduler is shown to be just an 
alternative characterization of a program's com- 
putations. In section 4 we introduce the sched- 
uler restrictions necessary for our main result. 
The response time definition and the response 
time theorem are given in section 5. In section 
6 this result is shown to include as special cases 
several schedulers used in actual systems. In 
section 7 the afore-mentioned decision procedures 
are presented. 


2.0 Parallel Prograns 


A parallel program P is a finite directed 
graph G, a distinguished node qi of G, and edges 


which are labelled with elements from a finite 
set E. Intuitively, the nodes of G are the states 
Of Pe Af 


ones 
is an edge, then we mo) folowing semantics: 
"Tf P is in state qa: then event e can 
occur resulting in P going into state ge 


Clearly, so far, P is nothing more than a finite 
state diagram. As an aside, while our main theo- 


rem depends on the assumption of finite states, 
our basic definitions and indeed several of our 
results can be generalized to allow infinite 
state parallel programs. 

Formally, a parallel program P is a 4-tuple 
P = (Q,E,q,>7) where: 


(1) Qis a finite set of states, denote 
Q = {4).d5>+++94)- 


(2) Eis a finite set of events, denote 
E = {1,2,...,m}. 
(3) qi is a distinguished start state. 


(4) t is the state transttton functton: 
T:Q x E ---> Q. 


It should be noted that this definition is 
by no means novel. It links well with path ex- 
pressions [1] and many other such definitions. 
Also, note that we have deliberately defined a 
parallel program to be a rather unstructured ob- 
ject. The usual notions of process, semaphores, 
instruction counters and so forth, are implicit 
rather than explicit. 

As an example of a parallel program, consider 
the following directed graph, which we will call 
example 1: 


2 3 


This corresponds to a parallel program represented 
in the semaphore notation of [5] as follows: 


semaphore s (initially 1) 


parbegin 
repeat 1: P(s); 2:V(s) forever 
repeat 3: P(s); 4:V(s) forever; 
parend; 


Indeed, our model of parallel programs is capable 
of representing the control aspects of any para-~ 
llel program which uses bounded value semaphores. 


2.1 Parallel Program Computations 


In order to study the response time of para- 
llel programs, it is necessary to introduce the 
notion of an event blocking. Thus our definition 
of a parallel program's computations must include 
both event execution and event blocking. To this 
end, let the elements of E be called event execu- 
tions. Then the elements of E'={e'|ecE} are 
called event blockings. The elements EA=E vu E' 
are called event activities. 

We will define a parallel program's computa- 
tions to be certain finite and infinite strings 
over EA. Intuitively, an event e may execute 
whenever the program's control is in a state q 
where e is eligible to execute (i.e., t(q,e) is 
defined). An event e may block whenever the pro- 
gram is in a state where e cannot execute, the 
program has passed through a state where e could 
have executed but didn't, and e hasn't executed 
in the meantime. To formally define those 
strings over EA which satisfy this intuitive no- 
tion, we introduce the following function on EA*: 


Definition: The function state:EA* ---> Q is de- 
fined as follows: 
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(1) state(A) = qd 


(2) For eeE and xeEA*, 
state(xe) = t(state(x),e) 
(3) For e'eE' and xeEA*, 
state (xe') = state(x) only if 
(a) t(state(x),e) is undefined, and 
(b) for some event f, x=yfz such that 
t(state(y),e) is defined and not 
substr(fz,e). 
where substr is the usual substring 
predicate. 


Note that the ways in which state can be un- 
defined correspond to illegal event executions and 
blockings. For example, if P is in state q and 
t(q,e) is undefined, then e is not eligible to 
execute. Likewise, if t(q,e) is defined, then e 
can't block. We are now ready to define the. com- 
putations of a parallel program. 


Definition: The computations of a parallel pro- 
gram P are members of the set C, the union of the 
following two sets: 


(1) CF={xeEA*|state(x) is defined}. 

(2) ClI={x an infinite string over EA| 
state (y) is defined for all finite pre- 
fixes y of x}. 


The set CF is called the set of fintte computa- 
tions of P and CI the infinite computations. 
Note that CI may be empty and that C is closed 
under finite prefix. 

For later use, we distinguish a (possibly 
empty) subset of the finite computations: 


Definition: A finite computation xeCF is called 
terminating if for all events e of P both state 
(xe) and state(xe') are undefined. A computation 
terminates when no further event activity is pos- 
sible. 


The following are examples of legal and il- 
legal computations, in terms of regular expres- 
sions, for the parallel program presented above. 


Legal 
(1) (12 + 34)* - no event ever blocks 
(2) 13'2(12)* - event 3 remains blocked 


forever ; 
(3) 123(1')* - event 1 is forever blocking 
Illegal + 
(1) (2' + 4') - events 2 and 4 may never 
; block 


(2) 12341' - event 1 is ineligible to block 
since it can execute. 


Note that example 1 has no terminating computa-— 
tions. 


2.2 Parallel Program Total State 


At any point during the execution of a para- 
llel program P a (possibly empty) subset of the 
events will be blocked. We define the total 
state of the program to be the state of P's con- 
trol coupled with the subset of currently blocked 
events. 


Definition: A total state of a parallel program 
is a member of the set T={(q,B)|qeQ and BcE}. 


Note that T is finite for finite state para- 


llel programs. Given any computation we can com- 
pute the total state via the following function: 


Definition: The total state function tstate:C--- 
-> T is defined as: ? 


(1) tstate(A) = (q, >) 


(2) Let xfeC, BcE, qeQ and tstate(x)=(q,B). 
(a) If f=e then tstate(xf)=(t(q,e), 
| B-{e}). 
(b) If f=e' then tstate(xf)=(q,Bufe}). 


Parallel Program Schedulers 


We have defined the computations of a para- 
llel program P to be sequences of event executions 
and blockings. Which particular computation is 
produced by the execution of P is determined by 
the decisions made in an agent entirely external 
to P; namely, by the scheduler. The scheduler 
maintains a data structure that contains informa- 
tion such as the state in which P's control lies 
and the blocking status of P's events. We will 
call this data structure the scheduler state. A 
scheduler step consists of the scheduler deter- 
mining which events are eligible for event acti- 
vity, using a scheduling poltcy to determine 
which one of those events will execute or block, 
and then reflecting this decision by appropriate 
changes to the data structure (i.e., making a 
scheduler state transition). The scheduler re- 
peats this cycle as long as there are events 
eligible for event activity. | 

In this section, we will formally define the 
scheduler of a parailel program independently of 
any scheduling policies. We show that this is 
just an equivalent characterization of a parallel 
program's computations. Thus, in subsequent sec- 
tions when scheduling policies are introduced, we 
will be effectively restricting the computations 
that parallel programs produce. 


3.1 Seheduler State 


Let P be a parallel program having n states 
and m events. The scheduler state will consist 
of three types of information: 


3.0 


(1) The program state. 

(2) For each event, a delay which indicates 
the number of scheduler steps which 
have passed since the event's last 
activity. 

(3) An event status set which indicates 


whether or not an event is eligible to 


block. 


Accordingly, we have the following formal defini- 
tion: 


Definition: Let P be a parallel program having m 
events. A scheduler state S is an element of the 
set SS = Q x D x B where: 


(1) Qis the state set of P. 
(2) D=NN x NN x ... X NN 
| where NN={0,1,2,...}. 
(3) B={0,1} x {0,1} x ... x {0,1} (m times). 


‘The ith member of D indicates the delay of the ith 
event and the ith member of B indicates the block- 
ing status of the ith event, with 0 indicating in- 
eligibility. — | 

In order to facilitate future presentation 


(m times) 
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we now introduce several projection functions on 
scheduler states. Let S=(q3d).dys+-+sd sb) ,b,, 
esd) be an arbitrary element of SS. We have 


(1) pstate:SS ---> Q by pstate(S) = q. 

(2) delay:SS x E ---> NN by delay(S,i) = d.- 

(3) bloeked:SS x E ---> {true, false} by 
bloeked(S,i)={if b,=1 then true else false}. 

(4) bloekedset:SS ---> 2(E) by blockedset(S) = 
{ilblZocked(S,i)} where 2(E) is the power 
set of E. 

(5) totalstate:SS ---> T by totalstate(S) = 
(q,bloeckedset(S)). 

3.2 Schedules and the Scheduler 


A schedule for a parallel program P will be 
the non-empty sequence of scheduler states that 
correspond to a particular computation of P and 
the scheduler of P will be all schedules. We 
will show that P's scheduler is isomorphic to P's 
computation set. 


Definttion: 


has m event. 


Let P be a parallel program which 
Let Z=S,,S.,,... be a finite or in- 
Lee 
finite sequence of scheduler states. 
schedule for P if and only if 


(1) S=(q,30,0,--+,030,0,...,0) (2m zeroes) 

(2) ee, 
t Ee ’ ij 

l seeeod 3b, peeead Js 


Exactly one of the following two cases 
must hold: 


(a) 


Then Z is a 


For i>l, let S,=(q3d,5---.d 3b 
reat 
and Si44 (q'3d 


There is an event e in P such that 


(i) t(q,e)=q'. 
(ii) d,'=tit j=e then 0 else d.+1} 


(iii) ee j=e then 0 else {if 


t(q,j) is defined then 1 else 
biti. 


In this case we say e executes and 
denote by S5 R(e) Sear 


There is an event e in P such that 


(i) t(q,e) is undefined and q'=q. 
(ii) d,'={if j=e then 0 else d,+l1}. 


(iii) b,'=tif j=e then 1 else bj}. 


(b) 


In this case we say e blocks and de- 
' 
note by S5 R(e') Sear 


Definition: Let P be a parallel program. Then © 
the scheduler for P is the set S={Z a sequence of 
scheduler states|Z is a schedule for P}. 


Theorem: Let P be a parallel program. Then the 
set of P's computations C is isomorphic to P's 
scheduler S. 


Proof: We only sketch the proof. Define the 
function makeseh:C ---> S as follows: 


makesch (A)=(q,30,0,.+.5030,0,..- 50) 


For xfeC where feEA and makesch(x)=S, 
makesch(xf)=S' such that S R(f) S'. 


It should be clear that makesech is well-defined, 
one-to-one, and onto. 


(1) 
(2) 


Notation: We let makecomp denote the inverse 
function of makesch. 


As with computations, we will talk of finite, 
infinite, and terminating schedules. | 


4.0 Inittal Scheduler Policy 


In this section we introduce three schedu- 
ling policies, the first two are common in the 
literature -- the third new, which allow us to 
develop our concept of response time. 


4.1 The Busy Watt Free Policy 


Recall that in example 1 we had z=123(1')* 
as a legal computation in which event 1 is for- 
ever blocking. Although the program is techni- 
cally executing, it is essentially doing nothing. . 
This phenomenon has been dubbed busy wait [5] 
and great care has been taken to avoid it in the 
design of operating systems [3,6,9]. Hence, our 
first scheduling policy will be a "busy wait free" 
policy. 


Definttton: Let Z=S)S,.+. 
parallel program P. Z is called busy watt free 
if for all i2l, Ss. R(e') S544 implies not blocked 
(Se). 

Intuitively, under the busy wait free policy 
once an event e blocks e may not block again until 
e has executed at least once. This rules out 


123(1')* as a computation but 1231'(43)* is still 
legal. The busy wait free scheduler for P is then 


Definition: The set SF={ZeS|Z is busy wait free} 
is called the busy watt free scheduler for P. 


be a schedule for a 


and the allowable computations under the busy 
wait free policy are 


Definition: The members of the set CF={makecomp 
(Z)|ZeSF} are called the busy watt free computa- 
ttons of P. 


The following result is immediate from the 
definitions of busy wait free schedules and com- 
putations. 


Lemma 1: Let weC. Then weCF iff. for all events 
e and decompositions w=xe'ye'z, we have substr 


(y,e). 
4.2 The Release Policy 


As noted above, even with the busy wait free 
policy we have z=1231'(43)* as a legal computa- 
tion for example 1. In z event 1 blocks but is 
never released (i.e., it never executes again 
even though it is capable of doing so). This is, 
in general, unacceptable. For example, event 1 
could represent a data recording process and we 
would want it to eventually be executed if it has 
data to record. Satisfying this criterion has 
been called showing that an event executes "in- 
finitely often" (if it is capable of doing so) 
{1,4,11]. Necessary for showing that an event 
executes infinitely often is the imposition of a 
"release" scheduling policy. 

Definttton: Let zZ=S,5, 
schedule for a parallel program P. Z is called a 
release schedule if for all i21 and arbitrary 


-.. be a busy wait free 
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distinct events e and f, we have S5 R(e) §S 
not blocked(S, ,e), and blocked(S, ,f) imply 
(pstate(S,),f) is undefined. 


i+]? 


Intuitively, under the release scheduling 
policy when there is a choice between executing 
either blocked or non-blocked events a blocked 
event is chosen. Thus 1231'(43)* is ruled out as 
a computation for example 1 since for the second 
and subsequent executions of event 3 the blocked 
event 1 could have been executed. We now have 


Definition: The set SFR={ZeSF|Z is a release 
schedule} is called the release scheduler for P. 


and the allowable computations under the release 
scheduling policy are: 


Definttion: The members of the set CFR={makecomp 
(Z)|ZeSFR} are called the release computattons 
of P. 


The following result is immediate from the 
definitions of release schedules and computarions: 


Lemma 2: Let weCF. Then weCFR iff. for all dis- 
tinct events e and f and decompositions w=xey, we 
have xfyeCF and blocked(x,f) imply blocked(x,e). 


Here, blocked(x,e) is the expected predicate on 
ts tate (x). 


4.3 The N-Fatr Policy 


Under the release scheduling policy we still 
have z=(123(43)*4)* as a legal computation for 
example 1. In z event 1 executes infinitely often 
but from any execution of 1 to its subsequent exe- 
cution an arbitrary number of scheduler steps may 
pass. In certain applications this would be in- 
tolerable. To remedy this situation we introduce 
an "N-fair" scheduling policy. 


Definttton: Let Z=S So+++ 
for a parallel program P. Let N be a fixed inte- 
ger 2 1. Z is called an N-fatr schedule if for 
all i21 and all eceE, not blocked (S, ,e) implies 
delay (S, ,e) aN 


be a release schedule 


Intuitively, under the N-fair scheduling pol- 
icy events which are not blocked will undergo 
event activity (execute or block) within N sched- 
uler steps from the point of their last execution. 
Of course, blocked events may have to wait longer 
than N scheduler steps or forever, depending on 
the structures of the particular program. Thus 
in z event 1 would wait for at most N/2 execu- 
tions of event 3 since the N-fair policy would 
force the scheduler to consider event 1 at that 
time. We now have the following definitions: 


Definition: For fixed N2l, the set SN={ZeSR|Z is 
an N-fair schedule} is called the W-fatr sched- 
uler for P. : 


and the allowable computations under the N-fair 
scheduling policy are: 


Definttion: For fixed N21, the members of the 
set CN={makecomp(Z)|ZeSN} are called the N-fair 
computattons of P. 


The following results are immediate from the 
definitions of N-fair schedules and computations: 


Lemma. 3: Fix N21 and let weCFR. Then weCN iff. 
for all events ecE and decompositions w=xyz with 
ly|>n, not substr(y,e), and not substr(y,e'), we 
have blocked(x,e). 


Here |y| denotes the length of the string y. 


Lemma 4: Fix M>N21. Then CNcCM. 


Pefore proceeding, we present a lemma which 
will be crucial in proving our response time re- 
sults. 


Lenma 6: For a fixed N21, let xeCN with the fol- 
lowing properties: 


(1) x=yz with |2|=MPN. 
(2) tstate(y)=tstate (yz) 


Then for all izl, x, «CM where x 
of z). 


Proof: Note that by the determinism of t we have 
tstate (x)=ts tate (x, ) for all i2l. For i=l, x= 


Fix i>l. 


pT Z2e +02 (i copies 


xeCM by Lemma 4. 
(1) 


If x, is not in C, then we contradict 


x being in C. 


(2) Tf Xs is not in CF, then we contradict 


Lemma l. 
If Xs is not in CR, then we contradict 


(3) 


Lemma 2. 
Suppose that event e is the reason why 
x, is not in CM. There are three sub- 


(4) 
cases: 
(a) 
(b) 
(c) 


If substr(z,e), then we contradict 
Lemma 3. 

If not substr(z,e) and blocked(x,e) 
then we contradict Lemma 3. 

If not substr(z,e) and not blocked 
(x,e), then we contradict x being 
in CN. 


Implementation Considerations 


Implementing the busy wait free and release 
scheduling policies is a rather trivial task: 
the decisions to be made in a scheduler step can 
be determined entirely from the scheduler state 
independently of past or future decisions (i.e., 
the scheduler would be a Markov process). Note, 
however, that this is not true when an N-fair 
policy is in effect. When making a decision on 
event activity the scheduler must consider not 
only past decisions (i.e., event delays) but also 
the structure of the parallel program under con- 
sideration since a faulty decision might make 
violation of the N-fair policy inevitable. Thus, 
some degree of "lookahead" must be done. While 
this can always be done for finite state parallel 
programs, there will be some infinite state pro- 
grams which require infinite lookahead and thus 
' N-fair scheduling becomes impossible. 

As can readily be seen in example 1, low 
values of N can severely restrict the scheduler. 
For example, under 2-fair scheduling there are 
only four computations: 12, 13', 34, and 31'. 
Since each computation is non-terminating, we 
have an anomalous situation. In general, we 
should choose N at least as large as the number 
of events in the program. 


4.3.1 
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like these finite response times to be 


5.0 Response Time of Parallel Programs 


Recall in the computation z=(123(43)%*4)* 
for example 1, under N-fair scheduling once event 
1 executes it will wait at most N scheduler eEeEe 
to execute again. We call this time of 
waiting the response time of an event. We will be 
concerned with the worst case response time of an 
event for all possible schedules since a parallel 
program with acceptable worst case peney ter is 
acceptable in general. 

In most applications we would like all events 
to have finite response times. Moreover, we would 
"accep- 
table" in some sense. Suppose we have a two 
event parallel program P in which it is known that 
both events, say e and f, have finite response 
time for all values of N. Suppose further that 
event e has acceptable response time r(e) for N=t 
but £'s response time is unacceptable for N<5t. 
Hence, we must adopt a 5t-fair scheduling policy 
to have any hope that both events will have ac- 
ceptable response time. A basic question is: 
is event e's response time affected by this in- 
crease in N? In this section we answer the ques- 
tion by showing that e's response time will in- 
crease only linearly in N. 

We have the following definitions: 
Definition: Let e be any event of a parallel pro- 
gram F and for N21, let Z=S S00: be in SN. The 
response time of e in Z, denote r(e,N,Z), is 


how 


case ji: 
final scheduler state and blocked(S_e). 
r(e,N,Z) is infinity. 

Otherwise, r(e,N ,Z)=max{delay (S,,e)|i21}. 


The N-response time of e, denote r(e,N), is 
r(e,N)=max{r(e,N »Z)|ZeSN}. 


Hence, there are two ways that r(e,N) might 
be infinite: the program could terminate with e 
blocked, or e might block and never execute again 
in spite of the fact that the program never ter- 
minates. This latter condition has been defined 
as "individual starvation" [7]. 


Z is a terminating schedule with Sn the 
Then 


case 2: 


The following result is immediate from the 
definitions: 


Lemma 6: Let e be any event of a parallel pro- 
gram P and for N21, let Z=S,S5.-. be in SN. If 


r(e,N,Z) > N then e is blocked in Z for r(e,N,Z) 
consecutive scheduler steps. 


5.1 Response Time Theorem 


We first prove the following lemma, which 
holds for general string systems. 


Lemma 7: Let A be a finite set, weA*, and N22, 
If [w]=2|A|N+2, then there exists an acA such 
that w can be decomposed as w=xayaz with |y|>N. 


Proof: (by induction on |A|) af |A|=1 then 
|w|22N+2. The form of w must be w=aya where 
|y]22N>N. 


Assume the result holds for |A|<k, for fixed 
k22. If JAl=k then |w]22kN + 2>2N + 2. Assume a 
is the first character of w and decompose w as 


w=axy where |x/=N +1. We have two cases: 


(1) If substr(y,a) we are done. 

(2) If not substr(y,a) then y is in 
(A-{a})* and ]A-{a}]=k - 1. We have 
ja] + |x] + |y|2=2kN + 2. Thus 


ly|22kN + 2 - 1 - N - 1=2kKN - N. Since 
2 - NsO, we have |y|22kN-N+2-N= 
2(k-1)N + 2. By the induction hypothe- 
sis on y, there is beA-{a} such that y 
can be decomposed as y=x'by'bz' and 
ly'] > N. 


We are now ready to prove our main result. 


Response Time Theorem: Let P be a parallel pro- 


gram having n states and m events. For any event 
e of P either: 


(1) There exist N22 such that for all M2N 
r(e,M) is infinity, or 
(2) For all N22 there is a constant c>0 


such that r(e,N)<cN. 


Proof: Assume that r(e,N) is finite for all N22. 
Suppose there is an M22 such that for all con- 
stants c>0O we have r(e,M)>cM. 


Let d=n2"=]T] be the number of total states 
of P and look at the constant c'=2d + 1. There 
must be a schedule Z in SM such that r(e,M,Z)>c'M. 


Let Z=S Soe and let Y=totalstate(S, )totalstate 


(Si )eee- 


By Lemma 6 we can decompose Y as Y=X)XoX, 
where |X, |=c"M and e is always blocked in X,. 
Thus, |X, |=c'M=(2d+1)M=2dM+M>2dMt2. Applying 
Lemma 7 to Xo» there is a total state X such that 


X =X) XX_ XK. and |X, ]>M. 


X. 
Rewriting Y, we have Y=X 


Clearly, e is blocked in 


leat ee tigi “a 


quences of P's total states which correspond to 
the schedule Z. 


Let L1=|X,X,X| - 120 and L2=|X,x| — ]>M. 


as the se- 


Look at the computation corresponding to the 
schedule Z: z=makecomp(Z). We can decompose z 


aS Z=Z,Z,2, where |z,|=L1 and |z,1=L2. 


By the above arguments we have tstate(z,)= 
tstate(z)25), blocked(z,:e)» and e doesn't execute 
z,€CM 


12 


1727272 


in CL2 and it follows that r(e,L2) is infinity. 
Thus, with this contradiction, r(e,M)<cM for all 
M22. 


in Zo (i.e., not substr(z,,e))- Also, z 


and |Z, |=L2>M. Hence, by Lemma 5, z Te 


As an interesting sidelight of this proof 
we have established an upper bound on c to be 


6.0 Addittonal Scheduler Poltctes 


We have defined only the minimum amount of 
scheduler policies needed to prove the response 
time theorem. Observe that the N-fair scheduler 
will make an arbitrary choice when more than one 
blocked event is capable of executing. Because 
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of this, it possible that an event may have an in- 
finite response time even though it is always 
capable of executing. A way to avoid this is by 

a FIFO scheduling policy, as has been suggested 

in [6,9]. We have 


Definttiton: Let P be a parallel program and for 


fixed N22, let Z=8 S5-+: be in SN.. Z is called 


an W-fatir FIFO schedule if for all i>1 and arbi- 
trary distinct events e and f, the following 
holds: S5 R(e) Sua blocked(S, ,f), and t(pstate 


(S,)»£) defined imply delay (S, ,e)2delay(S,,f). 


In this definintion note that the release 
policy guarantees blocked(S. ,e). Intuitively, in 


FIFO scheduling when there is a choice of execu- 
ting several blocked events, an event which has 
been blocked for a maximum number of scheduler 
steps is chosen. Since the FIFO policy is a 
restriction of N-fair scheduling, we have the 
following corollary: 


Corollary 1: The response time theorem holds un- 
der an N-fair FIFO scheduling policy. 


In certain applications it is desirable that 
the choice among blocked events be made on the 
importance of the events rather than on the egal- 
itarian FIFO rule. This is called priority sched- 


uling [3]. Each event is given a priority as 
follows: 

Definition: Let P be a parallel program. A pri- 
ortty function is a total mapping 9:E ---> NN. 


When several blocked events are capable of execu- 
ting, the choice is made on the basis of maximum 
priority: 


Definition: 


Let P be a parallel program with pri- 


ority function 0. For fixed N22, let Z=5 So+++ 


be in SN. Z is called an N-fatr 6 prtority sched- 
ule if for all i21 and arbitrary distinct events 
e and f, the following holds: S; R(e) Sia? 


blocked(S, ,f), and t (pstate(S.),f) defined imply 
6(e)20(f). 


Note, unlike FIFO scheduling, it is possible 
under priority scheduling for an event to have in- 
finite response time even though it is always 
capable of executing. Since priority scheduling 
is a restriction of the N-fair policy, we have 


Corollary 2: The response time theorem holds un- 
der an N-fair priority scheduling policy. 


7.0 Response Time Dectston Procedures 


There are two additional questions we must 
answer in order to completely describe the res- 
ponse time behavior of a parallel program: 


(1) Given an event e of a parallel program 
P, is the response time of e ever in- 
finity? 

What is the minimum constant c>0 which 
describes the linear growth of r(e,N) 
as N grows? 


(2) 


We will answer these questions by providing deci- 
sion procedures for the following: 


(a) Does there exist an N22 such that r(e,N) 


is infinite? 


(b) Given c>0 is r(e,N)<cN for all N22? 


The answer to question (1) follows directly from 
(a). Question (2) is answered by (b) and the 
observation in section 5 that the minimum con- 


stant is bounded above by 1 + OV sailed? 


The decision procedures for (a) and (b) will 
be by reduction to questions about suitably en- 
coded vector addition systems. 


7.1 Vector Addition Systems 


In this section we briefly review the defi- 
nition of vector addition systems [10], their 
decision procedures that we will use, and relate 
these systems to our definition of a parallel 
program's scheduler. 


Definttton: A vector additton system of degree k, 
denote VAS, is a 2-tuple W=(v,V) where: 


(1) The start vector veNN*=NN X NN X...X NN 
(k times). 
(2) Vis a finite set of vectors, each in 


ZZ X ZZ X...X ZZ (k times) where 
271 eae yg 257101 25s00h 


Definition: The reachability set of a VAS W, de- 


note R(W), is a subset of nn recursively defined 
as follows: 


(1) veR(W). 7 
(2) for xeR(W) and weW, xtweR(W) iff. xtw20. 


We will be using the following two problems 
which are concerned with the reachability set of 
a VAS. 


Definttton: The boundedness problem: given an 
arbitrary x20 is there a yeR(W) such that y2x? 


Definttton: The reachabtlity problem: 
arbitrary x20, is xeR(W)? 


given an 


A decision procedure for the boundedness problem 
can be found in [10]. The decidability of the 
reachability problem has recently been claimed in 
[13]. 

A link between a parallel program's schedu- 
ler and vector addition systems is that a VAS can 
represent any finite state control activity [10]. 


7.2 A High Level VAS Language 


Rather than work with vectors of integers, 
it will be more convenient and convincing to give 
the VAS reductions in terms of a "high-level" 
nondeterministic VAS language. This approach has 
been previously used in [12]. 

There are five statement types in the lan- 
guage: initialization of variables, assignment, 
nondeterministic branch, testing the finite 
state control, and updating the finite state con- 
trol. All but the first statement types may have 
a statement label. The syntax and semantics are 
as follows: 

Inittaltzatton 

Var V)=ay Vo=Ag ore VAAL 

The distinct variables VyprVosereoV, are initial- 


ized to the respective natural numbers Ap rAn seers 
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as Variables not initialized start at zero. 
Assignment 
Vi <---— Vv, +t Clyeee VV <--- VM Ote 
1 J 1? > 'n n n 


where the v's are distinct variables and the c's 
are integers. The assignment can take place only 
if x, + c,20 for all i. Otherwise, the VAS com- 


putation terminates. 


Guessing 
guess(S,»S,5+++58_) 


This statement causes a nondeterministic branch 
to one of the statements labelled S12Sooeee98 + 


If n=l then the branch is deterministic. 


Testing Event Activtty 
event (character) 


This statement is used to see which events are 
eligible for event activity. It returns a list 
of eligible events, each prefixed by the supplied 
character. If no event activity is possible 
(i.e., the parallel program has terminated), then 
the list consists exclusively of the supplied 
character. For example, if events 1 and 2 can 
execute and event 3 can block, then event(s) re- 
turns sl,s2,s3. If no event activity can take 
place, the list would be s. Event is always used 
in conjunction with the guess statement, e.g., 
guess (event(s)). 


Testing for a Blocked Event 
blocked(e,s) 


This statement causes a branch to Statement s if 
event e is blocked. If e isn't blocked then the 
statement acts like a no-op. 


Updating the Control 
update (f) 


Here f is either an event execution (e) or an 
event blocking (e'). This statement reflects in 
the finite state control the result of event acti- 
vity f. 


Globally, VAS programs are listed one state- 
ment per line and execution commences at the first 
statement. Execution proceeds sequentially until 
a guess is encountered, whence several nondeter- 
ministic computations may be spawned. A computa- 
tion may terminate in the ways listed above or by 
executing the last statement in the list (when it 
is not a guess statement). Although not listed 
above, we also have a mo-op statement with the 
obvious semantics. | 


7.3 VAS Reductions 


We now show that the questions posed above 
are reducible to questions about suitable VAS sys- 
tems. 


Theorem 2: Let P be a parallel program having n 
states and m events. For an event e of P the 
question of whether or not there is an N22 such 
that r(e,N) is infinity is reducible to a bounded- 
ness question. 


Proof: The vector addition system will have the 
following coordinates: 


<finite state control, local control, 
dio: ee yd oM,>- ee oM > Be 


To facilitate the presentation of the VAS program 
we will employ obvious abbreviations described in 
comments and the following two macros: 


zerode Lay (<1>,<2>) 
l: d<l> <--- d<l> - 1,M<1l> <--- M<l> + 1 
guess (1,<2>) 


This macro takes two string inputs and does the 
usual concatenation. Its function is to try to 
set the delay counting variable d<l> to zero. 


sucde Lay (<1>) 

blocked (<1>,1) 

d<l]> <--- d<l> + 1,M<1> <--- M<l> - 1 
1: no-op 


The purpose of this macro is to increase the de- 
lay count variable d<l> of a nonblocked event. 


The VAS program is as follows: 


var M,=1,M =1,...,M =1 


1 2 
comment: guess N 


103M, <a S Md nae Mee M 


guess (10,20) 


comment: Simulate P - guess to start phase 3 when- 
ever e blocks 


20: guess (event (2)) 
2: guess (20) 


comment: the following group of statements is re- 
peated for l<i<m. 


2i: zerodelay (i,21i) 
2ii: update (i) 


comment: the following statement is repeated for 
each j not equal to i. 


suede Lay (j) 
guess (20) 


comment: the following group of statements is re- 
peated for each i not equal to e. 


2i': zerodetlay(i,2ii') 
2ii': update (i') 


comment: the following statement is repeated for 
each j not equal to i. 


sucde Lay (j) 
guess (20) 


2e': zerodelay (e,2ee') 
2ee': update(e') 


comment: the following statement is repeated for 
each j not equal to e. 


sucde Lay (j) 
guess (20,30) 


comment: simulate P assuming that e never exe- 
cutes again. 


30: B <--- B+ 1 
guess (event (3)) 
3: guess (30) 


comment: as in phase 2, the following group of 
statements is repeated. However, here it is re- 
peated for each i not equal to e. 
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31: zerodetay (i, 3ii) 
311i: update (i) 


comment: the following statement is repeated for 
each j not equal to i. 


sucde Lay (4) 
guess (30) 


comment: the following group of statements is re- 
peated for each i not equal to e. 


3i': zerodelay(i,3ii') 
3ii': update (i') 


comment: the following statement is repeated for 
each j not equal to i. 


sucde Lay (j) 
guess (30) 


comment: by busy wait free, 3e' is impossible. 
3e: guess (3e) 


The result follows since r(e,N) is always finite 
iff. B is bounded. 


Several comments are in order about this VAS 
program. In phase 1, by nondeterminism, every 
value of N22 is considered. In phase 2, the N- 
fair execution of the parallel program is simu- 
lated. An event activity as dictated by the 
finite state control is nondeterministically 
chosen and appropriate event delay counts are per- 
formed on the d variables. The variable pairs 
d, and M, play a crucial role in that they force 


only N-fair computations to be considered. Note 


that d. i M,=N is invariant. When an event is 


executed or blocked we try to set d. to zero. 


The crucial part of the simulation is to observe 
that even if d. isn't set exactly to zero (it 


will be in some computation) we still get N-fair 
computations since M-fair computations are N-fair 
computations for M<N. Similarly, d. is increased 


by 1 whenever event i is blocked and another event 
activity takes place. 

Phase 3 is started nondeterministically when- 
ever event e blocks in phase 2. The purpose of 
phase 3 is to assume e will never execute again 
and reflect this in variable B. If e does exe- 
cute, then phase 3 loops forever and B is bounded. 
If the parallel program terminates (i.e., no 
event activity with e blocked) or e is never exe- 
cuted again, the B grows unboundedly. 


For the second VAS reduction we will simply 
modify the above VAS program. 


Theorem 3: Given c>0, whether or not r(e,N)<cN 
for all N22 is reducible to a reachability ques- 
tion. 


Proof: 


program with an initial value of D=c + l. 
ment 10 is changed to 


10: M) <--- M, + 1,.6.,M) <--- M_ +.1,D <--- D+e 


Hence, after phase 1 completes D has a value of 
cN + l. 


A fourth phase is added at the end of the 
program as follows: 


A variable D is added to the above VAS 
State- 


40: D <--- D - 1,B <--- B- 1 
guess (40) 


B 
VAS 


The purpose of the fourth phase is to see if 
ever is 2 D. If so, then there must be some 
computation in which B=D and thus r(e,N)>cN. 
This happens only when D=B=0 can be reached. It 
remains only to make changes to the VAS program 
to nondeterministically start phase four. They 
are: 


(1) Change statement 2 to 2: guess(40). 
(2) Change each guess(30) in phase 3 to 
guess (30,40). 


Hence, r(e,N)>cN iff. D=B=0 is reached. 
8.0 Conclustons 


We have introduced the notion of an N-fair 
scheduling policy as a condition which allows the 
development of theoretical results on the response 
time behavior of parallel programs. We have 
shown that for any event either the response time 
is infinite or it is linear in the choice of N, 
that one can determine which is the case, and 
that one can compute the exact linear relation- 
ship in the finite case. 

Although the methods used would seem to in- 
dicate that computing the exact response time be- 
havior of a parallel program is an intractable 
task, the development of heuristics for computing 
good upper bounds on response time is under in- 
vestigation. 
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Algorithmic Analysis Of Control Structure Behavior 


by 


R.M. Mattheyses and S.E. Conry* 


Clarkson College, Potsdam, NY 


Summary 


It is frequently convenient to view an asyn- 
chronous digital system involving parallel pro- 
cessing as being comprised of two parts: a con- 
trol structure and a device structure. In this 
paper we are primarily concerned with analysis of 
the control structure for such a system. 


The control structures studied in this paper 
are modular in nature. Their primitive elements 
are control modules which behave in much the same 
manner as many which have been previously proposed 
[1-4]. We have incorporated these modular control 
structures in a model for parallel processing sys- 
tems and obtained necessary and sufficient condi- 
tions under which the control structure of such a 
system is well behaved. (Similar problems have 
been investigated by others [2,5].) When presen- 
ted with a control structure of significant size, 
one immediately asks whether or not that control 
structure is well behaved. In this paper, three 
algorithms for determining whether or not a con- 
trol structure is well behaved are presented and 
analyzed. 


The first algorithm discussed is based di- 
rectly on the necessary and sufficient conditions 
for "good behavior" that have been obtained. The 
two other algorithms are based on a more genera- 
tive approach to the analysis of control struc- 
tures. We define what is effectively a reduction 
system for control structures and show that a 
control structure is certainly well behaved if it 
reduces to a singleton node using the reduction 
rules given. This approach to the analysis of 
control structures is very similar in flavor to 
the analysis of control flow graphs for sequential 
programs presented in [6,7]. 


Some well behaved control structures exhibit 
peculiar structural configurations which cannot be 
analyzed directly using the reduction rules given. 
In these cases, some transformation of the control 
structure is necessary if reduction to a singleton 
is to be achieved. Two algorithms based on the 
"reduction rules" approach to control structure 
analysis are presented and compared. Each incor- 
porates a different approach to analyzing the 
graph when structural peculiarities are present. 


Both of the "reduction rules" algorithms 
begin by applying these rules until either a sin- 
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gleton node is produced or no further reduction 

can be done. If a singleton is produced, both 
algorithms terminate and return the result that 

the control structure is well behaved. If, on 

the other hand, no further reductions are possible, 
there are two possibilities. 


The first "reduction rules" algorithm applies 
the direct algorithm to the remainder of the con- 
trol structure at this point. The second algo- 
rithm performs local transformations on the remain- 
ing control structures and proceeds, attempting 
to reduce still further. 
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